This code creates a map from wordform to part-of-speech using UD data.
To use the ud2pos tagger, first install this repository and then instantiate the UdTagger class.
To install the code in this repository run:
$ pip install -e .
To use the ud2pos tagger, add the following to your script:
import ud2pos
language = 'english'
tagger = ud2pos.UdTagger(language)
print('Pos tag is: %s' % tagger('Hi'))
This pos tagger will return 'UNK' for words not in the universal dependencies.
To process ud data for new languages, run the following commands:
Create a conda environment with
$ conda env create -f environment.yml
You can easily download UD data with the following command
$ make get_ud
You can then get the embeddings for a language with command
$ make process LANGUAGE=<language>
As languages, you should be able to experiment on any in UD. For instance: 'english'; 'czech'; 'basque'; 'finnish'; 'turkish'; 'arabic'; 'japanese'; 'tamil'; 'korean'; 'marathi'; 'urdu'; 'telugu'; 'indonesian'.
Upgrade the repository version in setup.py
. Then run:
$ python setup.py sdist
$ twine upload dist/*
To ask questions or report problems, please open an issue.