Comments (6)
First of all, pybo does nothing fancy while attributing POS tags.
The content of this file is read into a dict
with word: POS
. Then, once the tokenizer has produced a token, it checks if the dict has a POS tag in store for it. That's all. You could do it as I described it.
edit: I had forgotten to mention that every entry entering the trie is inflected with the affixed particles. So it's a bit more than filling a dict
. pybo does inflection where required and then attributes to all inflected versions the same POS.
Then, if your tokens are pybo tokens, they should have by default the POS tags in them. If not, you might want to change your tokenizer profile.
It would be wonderful to have pybo do smarter things to attribute POS, but for the moment, that's all it does.
from botok.
In my use case, if the user of my program provides a tibetan text with running words, my program will use pybo to first tokenize it and then do lemmatization and POS-tagging if necessary. If the user provides a tibetan text that have already been tokenized (space delimited) using other libraries or tools, my program will just simply split the text into tokens by space.
So in the latter case, after splitting the text into tokens, I have to join the list of tokens back into a string, then feed it into pybo. But there's a catch, when pybo tries to tokenize the text for the second time, the results might not be the same as the original list (there're spaces between Tibetan words already, I'm not sure what's the behavior of pybo in this case).
If pybo does POS-tagging simply by using a mapping table, it would be easy for me to write one myself.
And I'm also trying to figure out how to lemmatize a list of tokens that have already been tokenized (not necessarily by pybo)?
Another question: the POS tags that pybo may assign to each token include "oov", "non-word", "non-bo", "syl", etc., which can't be found in the file you've mentioned, is there any other reference for these POS tags?
from botok.
Eventhough the solution is not entirely satisfactory, here is what is possible without rewriting and/or subclassing big parts of pybo:
from pybo import BoSyl, PyBoTrie, Config
def is_tibetan_letter(char):
"""
:param char: caracter to check
:return: True or False
"""
if (char >= 'ༀ' and char <= '༃') or (char >= 'ཀ' and char <= 'ྼ'):
return True
else:
return False
def add_tseks(word):
TSEK = '་'
if word and is_tibetan_letter(word[-1]) and word[-1] != TSEK:
return word + TSEK
else:
return word
# prepare the tokens
in_str = 'ཤི་ མཐའི་ བཀྲ་ཤིས་ tr བདེ་ལེགས ། བཀྲ་ཤིས་ བདེ་ལེགས་ ཀཀ'
tokens = in_str.split(' ')
tokens = [add_tseks(t) for t in tokens] # ending tseks are not in the trie
# initialize the trie
bt = PyBoTrie(BoSyl(), 'GMD', config=Config("pybo.yaml"))
# find the data stored in the trie about each token
with_pos = [(t, bt.has_word(t)) for t in tokens]
for num, w in enumerate(with_pos):
print(num, w)
# 0 ('ཤི་', {'exists': True, 'data': 'VERBᛃᛃᛃ'})
# 1 ('མཐའི་', {'exists': True, 'data': 'NOUNᛃgiᛃ2ᛃaa'})
# 2 ('བཀྲ་ཤིས་', {'exists': True, 'data': 'NOUNᛃᛃᛃ'})
# 3 ('tr', {'exists': False})
# 4 ('བདེ་ལེགས་', {'exists': True, 'data': 'NOUNᛃᛃᛃ'})
# 5 ('།', {'exists': False})
# 6 ('བཀྲ་ཤིས་', {'exists': True, 'data': 'NOUNᛃᛃᛃ'})
# 7 ('བདེ་ལེགས་', {'exists': True, 'data': 'NOUNᛃᛃᛃ'})
# 8 ('ཀཀ་', {'exists': False})
As you see, the data retrieved from the trie is strangely formatted and you will need to do a little cleanup to only keep the POS tags. This is a ugly part of pybo that is in the process of being cleaned up and improved by @10zinten .
In token 1, 'NOUNᛃgiᛃ2ᛃaa' contains the POS tag (NOUN), the type of affixed particle (gi), the amount of chars in the token pertaining to that affixed particle (2) and finally whether the token without the particle should end with a འ or not (aa). None of this information is relevant for your usecase, so you might just want to strip it off using the delimiter: ᛃ
As for the other values of Token.pos (oov, non-word, etc.) there are not POS tags per se. They give information about the type of token that is given by pybo's preprocessing and tokenizing steps, so in what I proposed here, none of it is available since it is dynamically generated. It is not stored in the trie...
Hope that helps. Doing as you proposed and remove all spaces then have tokenize it will definitely break up the original tokenization, which you don't want.
edit:
Pros: this second solution is able to deal with inflected words, as inflection is dealt with while creating the trie. The simple mapping table approach doesn't support that.
Cons: Words are expected to be well formed and all the preprocessing that happens before tokenizing is not available.
from botok.
Thanks, it's quite useful. I'll take a look and try to understand the snippet.
And is there any convenient way to directly get the lemmas of a list of tokens?
from botok.
Two distinct ways of obtaining the lemmas are implemented in pybo.
The primary strategy is to unaffix an inflected form. Just as cat
is the lemma of cats
, removing the morpheme is sufficient. The cleaned and unaffixed version of the word is what ends up in Token.lemma.
In order to do that, one needs the information that is dynamically generated while creating the trie: is there an affixed particle ? if so, how many chars does it take? and does the word require the addition of an འ to reconstruct the unaffixed word ? Starting from external tokens won't give this information, so this type of lemmas can't be derived.
This is implemented in this property of the Token class.
The second strategy is to retrieve a lemma from a mapping table. It is what allows to retrieve lemmas such as mouse
from mice
. Actually, the second strategy is applied after having executed the first one, so the mapping table is kept as small as possible. This is implemented in LemmatizeTokens().
So I guess trying to get lemmas for outside tokens will be difficult because the unaffixation process is dynamically generated in pybo. On the other hand, the mapping table part can be externalized easily.
edit:
At the moment, the lemma mapping table only contains the case particles. But everything is ready for other lists of word: lemma
pairs. I am not aware of the existence of such lists and creating them from scratch takes a lot of manual work...
from botok.
Thanks for the information! I'll try it out.
from botok.
Related Issues (20)
- Splitting མངས་བས་ wrong?
- Missing English words at the end of the text during sentence tokenization
- 催更帮助文档!
- understanding custom pipelines HOT 3
- dict like `get` method for Token object
- detect any language
- Download of dialect packs fails on macOS when running CI HOT 1
- Why VOWELS constant only has one vowel? HOT 1
- Invalid index in merge rule silently produces uncalled for result.
- Unexpected skip of syllable while tokenizing.
- POS tags ? distinguishing some patterns HOT 2
- identifying weak syllables HOT 1
- issue with Python 3.9
- importing a custom dictionary HOT 1
- syllable tokenizer request
- syllable component
- Missing pos for PUNCT
- `token.text_unaffixed` failed to add tsek
- Can we remove "Loading Trie... (1s.)" message
- [Feature] Classify all PUNCTs into left and right
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from botok.