Comments (7)
I barely know a thing about programming, let alone coding, but couldn't you use spaCy instead of jieba, it supports 64 languages
I am using simplemma
for lemmatization, which is simply a big text lookup program. There is no current need for spacy
for this project. It's a fairly big and complicated framework for many things (including natural language understanding, tagging, classification, etc) and requires dynamically downloading resources.
from vocabsieve.
I originally wanted to implement such a feature, but I couldn't quite afford to commit the time and maintenance burden needed to implement them. At one point I even implemented a Japanese parser, but Yomichan does a much better job, and it added too much in the way of dependencies. I didn't know of a dictionary-based way of splitting words before.
However, for Chinese this can be pretty useful.
If you are willing to contribute code to make this happen, feel free to do so! I would be glad to accept a patch/PR for this.
Here are a few points to keep in mind:
- What kind of dependencies would this introduce? Is it a static binary file plus some code to parse it?
cxfreeze
which is used to create binary exes for Mac and Windows is somewhat tricky to deal with when it comes to static files, but this should be solvable. However, I think it should still be preferable if the dictionary is to be imported by the user. - The splitting function should be called in the
preprocess_clipboard
function in https://github.com/FreeLanguageTools/vocabsieve/blob/master/vocabsieve/dictionary.py . If the dictionary-loading and splitting is simple enough, it can simply be implemented as another function in that file. Otherwise, you can create a new file for it. - If possible, I think the dictionary should simply be incorporated to the existing dictionary infrastructure (the database), and then the database can be queried for the splitting part.
If you have any questions about the architecture of the program, feel free to ask!
from vocabsieve.
Also, I was actually considering parsing the sentences with something like jieba. That uses a more sophisticated algorithm to split the words and may work for words not covered (proper names).
In addition, I can also implement support for cedict as a format
from vocabsieve.
I barely know a thing about programming, let alone coding, but you're talking about using spaCy right, it supports 64 languages so I guess that would work for all the other language vocabesieve supports
from vocabsieve.
In fact, not only Chinese, but also Japanese, Korean has this problem. In Vietnamese, space is used to separate syllables; in Thai and Lao, space is used to separate sentences.
There is a list of such tools:
https://polyglotclub.com/wiki/Language/Multiple-languages/Culture/Text-Processing-Tools#Word_Segmentation
from vocabsieve.
@GrimPixel @BenMueller
So, anyone of you going to actually implement this?
from vocabsieve.
I just knew about tools for word segmentation and saw you needed them. I have no experience with them.
from vocabsieve.
Related Issues (20)
- Wayland expected behaviour HOT 1
- Import from KOReader Vocab Builder cannot find the files it needs HOT 3
- [Feature Request] Support for phonetic reading of character based languages and abjads HOT 1
- Feature suggestion: Modular/Extensible Definition Lookup HOT 1
- Cannot read highlights. Make sure that your clipping file is in the right place, and it's length is a multiple of 5.
- KOReader Import not displaying PDFs, even after correcting metadata. HOT 7
- Allow adding cards for unknown words HOT 1
- Android devices cannot be selected for KOReader import HOT 1
- Lingva.ml is down, so translations are not available via Google Translate dictionary now. HOT 1
- Language matching in KoreaderVocabImporter is incomplete HOT 1
- Add a hand made definition even if no definition is found HOT 1
- Certain words with capital letters don't get parsed HOT 1
- Clear image option HOT 1
- Ability to teach lemmatization or definitions HOT 1
- Update instructions would be very helpful HOT 2
- Clear vocab builder database for books after import HOT 5
- Not working with Serbo-Croatian HOT 2
- Gender support HOT 2
- [Ideas requested] Dictionary system overhaul
- .ogg audio doesn't preview on Windows HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vocabsieve.