Comments (3)
This is the current expected behaviour of BoTrie when the inflect_n_add method is used.
The idea is that a tag is different from a POS in that it contains more information that the part of speech per se(which will be nothing more than a primary UD tag.
Simply speaking, imagining you had inflected the entry while creating the Trie, and imagining the input string you would be tokenizing contained 'ཤིའོ་', you would end up with the following:
" ཤིའོ་"/VERBᛃoᛃ2ᛃFalse,
which means that your token is inflected in the terminal case (o), and that in order to reconstruct the unaffixed token, you need to delete 2 chars from the cleaned syllable and not add a འ(False) at the end of the token.
The extra information here is what pertains to the syllables that have affixed casual particles. In order to correctly tokenize affixed words, we want to have the required info to reconstruct the unaffixed word and reconstruct the full version of the casual particle as well.
This is embedded in three fields delimited by the characters you are refering to.
The content of these fields is produced here.
Anyhow, I will be documenting all this, so you will have a clearer idea of what is happening and if you want to modify the behaviour I coded or not.
from botok.
Coming back to this....maybe it could be opened to discuss what is the best way to get to the goal you have. At the moment it seems that the use of these extra characters in the token attribute 'tag' is redundant (as it's always exactly the same). Is it possible to remove it? This will cause a lot of confusion (even if it was documented) and makes the impression that something is broken.
'NOUNᛃᛃᛃ',
'NOUNᛃᛃᛃ',
'ADPᛃᛃᛃ',
'NOUNᛃᛃᛃ',
'VERBᛃᛃᛃ',
'NOUNᛃᛃᛃ',
'VERBᛃᛃᛃ',
'ADPᛃᛃᛃ',
from botok.
Ok, I take it back...I can see now examples where it's not redundant. Nevertheless, the question about what would be the cleanest way to achieve what you want to achieve stands.
from botok.
Related Issues (20)
- Splitting མངས་བས་ wrong?
- Missing English words at the end of the text during sentence tokenization
- 催更帮助文档!
- understanding custom pipelines HOT 3
- dict like `get` method for Token object
- detect any language
- Download of dialect packs fails on macOS when running CI HOT 1
- Why VOWELS constant only has one vowel? HOT 1
- Invalid index in merge rule silently produces uncalled for result.
- Unexpected skip of syllable while tokenizing.
- POS tags ? distinguishing some patterns HOT 2
- identifying weak syllables HOT 1
- issue with Python 3.9
- importing a custom dictionary HOT 1
- syllable tokenizer request
- syllable component
- Missing pos for PUNCT
- `token.text_unaffixed` failed to add tsek
- Can we remove "Loading Trie... (1s.)" message
- [Feature] Classify all PUNCTs into left and right
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from botok.