Giter VIP home page Giter VIP logo

mikahama / uralicnlp Goto Github PK

View Code? Open in Web Editor NEW
70.0 7.0 7.0 323 KB

An NLP library for Uralic languages such as Finnish, Skolt Sami, Moksha and so on. Also supporting some non-Uralic languages such as Spanish, French, Arabic, Swedish, Norwegian, Russian and English

Home Page: http://uralicnlp.com/

License: Apache License 2.0

Python 96.18% TeX 3.82%
nlp-library uralic-languages finnish sami moksha lemmatizer morphological-analysis disambiguation constraint-grammar morphological-generation

uralicnlp's Introduction

UralicNLP

Natural language processing for many languages

Updates Downloads DOI

UralicNLP can produce morphological analyses, generate morphological forms, lemmatize words and give lexical information about words in Uralic and other languages. The languages we support include the following languages: Finnish, Russian, German, English, Norwegian, Swedish, Arabic, Ingrian, Meadow & Eastern Mari, Votic, Olonets-Karelian, Erzya, Moksha, Hill Mari, Udmurt, Tundra Nenets, Komi-Permyak, North Sami, South Sami and Skolt Sami. Currently, UralicNLP uses stable builds for the supported languages.

See the catalog of supported languages

Some of the supported languages: 🇸🇦 🇪🇸 🇮🇹 🇵🇹 🇩🇪 🇫🇷 🇳🇱 🇬🇧 🇷🇺 🇫🇮 🇸🇪 🇳🇴 🇩🇰 🇱🇻 🇪🇪

Check out UralicGUI - a graphical user interface for UralicNLP.

☕ Check out UralicNLP official Java version

♯ Check out UralicNLP official C# version

Installation

The library can be installed from PyPi.

pip install uralicNLP

If you want to use the Constraint Grammar features (from uralicNLP.cg3 import Cg3), you will also need to install VISL CG-3.

🆕 Pyhfst UralicNLP uses a pure Python implementation of HFST!

Faster analysis and generation

UralicNLP uses Pyhfst, which can also be installed with Cython support for faster processing times:

pip install cython
pip install --upgrade --force-reinstall pyhfst --no-cache-dir

Usage

List supported languages

The API is under constant development and new languages will be added to the nightly builds system. That's why UralicNLP provides a functionality for looking up the list of currently supported languages. The method returns 3 letter ISO codes for the languages.

from uralicNLP import uralicApi
uralicApi.supported_languages()
>>{'cg': ['vot', 'lav', 'izh', 'rus', 'lut', 'fao', 'est', 'nob', 'ron', 'olo', 'bxr', 'hun', 'crk', 'chr', 'vep', 'deu', 'mrj', 'gle', 'sjd', 'nio', 'myv', 'som', 'sma', 'sms', 'smn', 'kal', 'bak', 'kca', 'otw', 'ciw', 'fkv', 'nds', 'kpv', 'sme', 'sje', 'evn', 'oji', 'ipk', 'fit', 'fin', 'mns', 'rmf', 'liv', 'cor', 'mdf', 'yrk', 'tat', 'smj'], 'dictionary': ['vot', 'lav', 'rus', 'est', 'nob', 'ron', 'olo', 'hun', 'koi', 'chr', 'deu', 'mrj', 'sjd', 'myv', 'som', 'sma', 'sms', 'smn', 'kal', 'fkv', 'mhr', 'kpv', 'sme', 'sje', 'hdn', 'fin', 'mns', 'mdf', 'vro', 'udm', 'smj'], 'morph': ['vot', 'lav', 'izh', 'rus', 'lut', 'fao', 'est', 'nob', 'swe', 'ron', 'eng', 'olo', 'bxr', 'hun', 'koi', 'crk', 'chr', 'vep', 'deu', 'mrj', 'ara', 'gle', 'sjd', 'nio', 'myv', 'som', 'sma', 'sms', 'smn', 'kal', 'bak', 'kca', 'otw', 'ciw', 'fkv', 'nds', 'mhr', 'kpv', 'sme', 'sje', 'evn', 'oji', 'ipk', 'fit', 'fin', 'mns', 'rmf', 'liv', 'cor', 'mdf', 'yrk', 'vro', 'udm', 'tat', 'smj']}

The dictionary key lists the languages that are supported by the lexical lookup, whereas morph lists the languages that have morphological FSTs and cg lists the languages that have a CG.

Download models

If you have a lot of data to process, it might be a good idea to download the morphological models for use on your computer locally. This can be done easily. Although, it is possible to use the transducers over Akusanat API by passing force_local=False.

On the command line:

python -m uralicNLP.download --languages fin eng

From python code:

from uralicNLP import uralicApi
uralicApi.download("fin")

When models are installed, generate(), analyze() and lemmatize() methods will automatically use them instead of the server side API. More information about the models.

Use uralicApi.model_info(language) to see information about the FSTs and CGs such as license and authors. If you know how to make this information more accurate, please don't hesitate to open an issue on GitHub.

from uralicNLP import uralicApi
uralicApi.model_info("fin")

To remove the models of a language, run

from uralicNLP import uralicApi
uralicApi.uninstall("fin")

Lemmatize words

A word form can be lemmatized with UralicNLP. This does not do any disambiguation but rather returns a list of all the possible lemmas.

from uralicNLP import uralicApi
uralicApi.lemmatize("вирев", "myv")
>>['вирев', 'вирь']
uralicApi.lemmatize("luutapiiri", "fin", word_boundaries=True)
>>['luuta|piiri', 'luu|tapiiri']

An example of lemmatizing the word вирев in Erzya (myv). By default, a descriptive analyzer is used. Use uralicApi.lemmatize("вирев", "myv", descriptive=False) for a non-descriptive analyzer. If word_boundaries is set to True, the lemmatizer will mark word boundaries with a |. You can also use your own transducer

Morphological analysis

Apart from just getting the lemmas, it's also possible to perform a complete morphological analysis.

from uralicNLP import uralicApi
uralicApi.analyze("voita", "fin")
>>[['voi+N+Sg+Par', 0.0], ['voi+N+Pl+Par', 0.0], ['voitaa+V+Act+Imprt+Prs+ConNeg+Sg2', 0.0], ['voitaa+V+Act+Imprt+Sg2', 0.0], ['voitaa+V+Act+Ind+Prs+ConNeg', 0.0], ['voittaa+V+Act+Imprt+Prs+ConNeg+Sg2', 0.0], ['voittaa+V+Act+Imprt+Sg2', 0.0], ['voittaa+V+Act+Ind+Prs+ConNeg', 0.0], ['vuo+N+Pl+Par', 0.0]]

An example of analyzing the word voita in Finnish (fin). The default analyzer is descriptive. To use a normative analyzer instead, use uralicApi.analyze("voita", "fin", descriptive=False). You can also use your own transducer

Morphological generation

From a lemma and a morphological analysis, it's possible to generate the desired word form.

from uralicNLP import uralicApi
uralicApi.generate("käsi+N+Sg+Par", "fin")
>>[['kättä', 0.0]]

An example of generating the singular partitive form for the Finnish noun käsi. The result is kättä. The default generator is a regular normative generator. uralicApi.generate("käsi+N+Sg+Par", "fin", dictionary_forms=True) uses a normative dictionary generator and uralicApi.generate("käsi+N+Sg+Par", "fin", descriptive=True) a descriptive generator. You can also use your own transducer

Morphological segmentation

UralicNLP makes it possible to split a word form into morphemes. (Note: this does not work with all languages)

from uralicNLP import uralicApi
uralicApi.segment("luutapiirinikin", "fin")
>>[['luu', 'tapiiri', 'ni', 'kin'], ['luuta', 'piiri', 'ni', 'kin']]

In the example, the word luutapiirinikin has two possible interpretations luu|tapiiri and luuta|piiri, the segmentation is done for both interpretations.

Access the HFST transducer

If you need to get a lower level access to the HFST transducer object, you can use the following code

from uralicNLP import uralicApi
sms_generator = uralicApi.get_transducer("sms", analyzer=False) #generator
sms_analyzer = uralicApi.get_transducer("sms", analyzer=True) #analyzer

The same parameters can be used here as for generate() and analyze() to specify whether you want to use the normative or descriptive analyzers and so on. The defaults are get_transducer(language, cache=True, analyzer=True, descriptive=True, dictionary_forms=True).

Syntax - Constraint Grammar disambiguation

Note this requires the models to be installed (see above) and VISL CG-3. The disambiguation process is simple.

from uralicNLP.cg3 import Cg3
from uralicNLP import tokenizer
sentence = "Kissa voi nauraa"
tokens = tokenizer.words(sentence)
cg = Cg3("fin")
print(cg.disambiguate(tokens))
>>[(u'Kissa', [<Kissa - N, Prop, Sg, Nom, <W:0.000000>>, <kissa - N, Sg, Nom, <W:0.000000>>]), (u'voi', [<voida - V, Act, Ind, Prs, Sg3, <W:0.000000>>]), (u'nauraa', [<nauraa - V, Act, InfA, Sg, Lat, <W:0.000000>>])]

The return object is a list of tuples. The first item in each tuple is the word form used in the sentence, the second item is a list of Cg3Word objects. In the case of a full disambiguation, these lists have only one Cg3Word object, but some times the result of the disambiguation still has some ambiguity. Each Cg3Word object has three variables lemma, form and morphology.

disambiguations = cg.disambiguate(tokens)
for disambiguation in disambiguations:
    possible_words = disambiguation[1]
    for possible_word in possible_words:
        print(possible_word.lemma, possible_word.morphology)
>>Kissa [u'N', u'Prop', u'Sg', u'Nom', u'<W:0.000000>']
>>kissa [u'N', u'Sg', u'Nom', u'<W:0.000000>']
>>voida [u'V', u'Act', u'Ind', u'Prs', u'Sg3', u'<W:0.000000>']
>>nauraa [u'V', u'Act', u'InfA', u'Sg', u'Lat', u'<W:0.000000>']

The cg.disambiguate takes in remove_symbols as an optional argument. Its default value is True which means that it removes the symbols (segments surrounded by @) from the FST output before feeding it to the CG disambiguator. If the value is set to False, the FST morphology is fed in to the CG unmodified.

The default FST analyzer is a descriptive one, to use a normative analyzer, set the descriptive parameter to False cg.disambiguate(tokens,descriptive=False).

Multilingual CG

It is possible to run one CG with tags produced by transducers of multiple languages.

from uralicNLP.cg3 import Cg3
cg = Cg3("fin", morphology_languages=["fin", "olo"])
print(cg.disambiguate(["Kissa","on","kotona", "."], language_flags=True))

The code above will use the Finnish (fin) CG rules to disambiguate the tags produced by Finnish (fin) and Olonets-Karelian (olo) transducers. The language_flags parameter can be used to append the language code at the end of each morphological reading to identify the transducer that produced the reading.

It is also possible to pipe multiple CG analyzers. This will run the initial morphological analysis in the first CG, disambiguate and pass the disambiguated results to the next CG analyzer.

from uralicNLP.cg3 import Cg3, Cg3Pipe

cg1 = Cg3("fin")
cg2 = Cg3("olo")

cg_pipe = Cg3Pipe(cg1, cg2)
print(cg_pipe.disambiguate(["Kissa","on","kotona", "."]))

The example above will create a CG analyzer for Finnish and Olonets-Karelian and pipe them into a Cg3Pipe object. The analyzer will first use Finnish CG with a Finnish FST to disambiguate the sentence, and then Olonets-Karelian CG to do a further disambiguation. Note that FST is only run in the first CG object of the pipe.

Dictionaries

UralicNLP makes it possible to obtain the lexicographic information from the Giella dictionaries. The information can contain data such as translations, example sentences, semantic tags, morphological information and so on. You have to define the language code of the dictionary.

For example, "sms" selects the Skolt Sami dictionary. The word used to query, however, can appear in any language. If the word is a lemma in Skolt Sami, the result will appear in "exact_match", if it's a word form for a Skolt Sami word, the results will appear in "lemmatized", and if it's a word in some other language, the results will appear in "other_languages", i.e if you search for cat in the Skolt Sami dictionary, you will get a result of a form {"other_languages": [Skolt Sami lexical items that translate to cat]}

An example of querying the Skolt Sami dictionary with car.

from uralicNLP import uralicApi
uralicApi.dictionary_search("car", "sms")
>>{'lemmatized': [], 'exact_match': [], 'other_languages': [{'lemma': 'autt', ...}, ...]

It is possible to list all lemmas in the dictionary:

from uralicNLP import uralicApi
uralicApi.dictionary_lemmas("sms")
>> ['autt', 'sokk' ...]

You can also group the lemmas by part-of-speech

from uralicNLP import uralicApi
uralicApi.dictionary_lemmas("sms",group_by_pos=True)
>> {"N": ['autt', 'sokk' ...], "V":[...]}

Fast Dictionary Look-ups

By default, UralicNLP uses a TinyDB backend. This is easy as it does not require an external database server, but it can be extremely slow. For this reason, UralicNLP provides a MongoDB backend.

Make sure you have both MongoDB and pymongo installed.

First, you will need to download the dictionary and import it to MongoDB. The following example shows how to do it for Komi-Zyrian.

from uralicNLP import uralicApi

uralicApi.download("kpv") #Download the latest dictionary data
uralicApi.import_dictionary_to_db("kpv") #Update the MongoDB with the new data

After the initial setup, you can use the dictionary queries, but you will need to specify the backend.

from uralicNLP import uralicApi
from uralicNLP.dictionary_backends import MongoDictionary
uralicApi.dictionary_lemmas("sms",backend=MongoDictionary)
uralicApi.dictionary_search("car", "sms",backend=MongoDictionary)

Now you can query the dictionaries fast.

Parsing UD CoNLL-U annotated TreeBank data

UralicNLP comes with tools for parsing and searching CoNLL-U formatted data. Please refer to the Wiki for the UD parser documentation.

Semantics

UralicNLP provides semantic models for Finnish (SemFi) and other Uralic languages (SemUr) for Komi-Zyrian, Erzya, Moksha and Skolt Sami. Find out how to use semantic models

Other functionalities

Cite

If you use UralicNLP in an academic publication, please cite it as follows:

Hämäläinen, Mika. (2019). UralicNLP: An NLP Library for Uralic Languages. Journal of open source software, 4(37), [1345]. https://doi.org/10.21105/joss.01345

@article{uralicnlp_2019, 
    title={{UralicNLP}: An {NLP} Library for {U}ralic Languages},
    DOI={10.21105/joss.01345}, 
    journal={Journal of Open Source Software}, 
    author={Mika Hämäläinen}, 
    year={2019}, 
    volume={4},
    number={37},
    pages={1345}
}

For citing the FSTs and CGs, see uralicApi.model_info(language).

The FST and CG tools and dictionaries come mostly from the GiellaLT repositories and Apertium.

uralicnlp's People

Contributors

mikahama avatar mokha avatar rueter avatar snomos avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

uralicnlp's Issues

Installation on windows (hfst-dev)?

Despite trying few hours, I'm not able to make this work in Windows 10. Apparently binaries for "hfst" are available for Python 3.6 and 3.7 w32, but there is nothing for "hfst-dev", which is required by uralicNLP.

Any ideas how to make this package work in Windows, e.g., minimal steps needed starting from a fresh environment?

Method name typo: "descrpitive"

A tiny detail.

uralicApi.lemmatize, uralicApi.analyze etc. seem to have the descrpitive field instead of descriptive. I guess it is a typo.

By the way is there any documentation of lemmatization behavior when descrpitive is True vs False? How about speed benchmarks?

Randomness in lemmatization

Lemmatization seem to give the order of the results in a non-deterministic manner for each session run.

To Reproduce

Creat the following test.py :

from uralicNLP import uralicApi
if __name__ == "__main__":
    for i in range(5):
        print(uralicApi.lemmatize(
            "liikkumisesta", language="fin", word_boundaries=False, descrpitive=True
        ))

and run several times with python3.7 test.py from the command line. The order of the results are the same for each call of the function in the same session run but it varies between script runs.

Results look like this e.g. for 4 runs:

user: ~/my/local/path $ python3.7 test.py 
['liikkuminen', 'liikkua']
['liikkuminen', 'liikkua']
['liikkuminen', 'liikkua']
['liikkuminen', 'liikkua']
['liikkuminen', 'liikkua']
user: ~/my/local/path $ python3.7 test.py 
['liikkua', 'liikkuminen']
['liikkua', 'liikkuminen']
['liikkua', 'liikkuminen']
['liikkua', 'liikkuminen']
['liikkua', 'liikkuminen']
user: ~/my/local/path $ python3.7 test.py 
['liikkuminen', 'liikkua']
['liikkuminen', 'liikkua']
['liikkuminen', 'liikkua']
['liikkuminen', 'liikkua']
['liikkuminen', 'liikkua']
user: ~/my/local/path $ python3.7 test.py 
['liikkuminen', 'liikkua']
['liikkuminen', 'liikkua']
['liikkuminen', 'liikkua']
['liikkuminen', 'liikkua']
['liikkuminen', 'liikkua']

Expected behavior
Deterministic results.

Desktop (please complete the following information):

  • OS: Ubuntu 16
  • uralicNLP version: 1.2.3 (installed with pip)
  • HFST version: 3.15.0.0b0
  • Python version: 3.7

Language model downloaded with the command python3.7 -m uralicNLP.download --languages fin

Is there a way to fix the seed or make this work in a deterministic manner? Does the order of lemmas imply a probability ranking? If not, one can just sort the results to force the determinism I believe (assuming lemmatization results themselves are deterministic).

In any case, I think this should be documented somewhere in the readme.

Morphology format in documentation/docstrings

I'm trying to use the constraint grammar disambiguation functionality. While provided explanation and example are helpful, the morphology format is not explained: what are the possible tags? What is their meaning?

The CG3 documentation also lacks this information as well, although I assume that could be because the tags are language specific (?): https://edu.visl.dk/cg3/single/#stream-vislcg

I've analyzed some sentences and found these tags:

'<W:0.000000>', '<W:281474976710655.000000>', '<fin>', '?', 
'@+FAUXV', '@+FMAINV', '@-FAUXV', '@-FMAINV', '@<ADVL', 
'@<OBJ', '@<P', '@<PN', '@<SPRED', '@<SUBJ', '@>N', '@>P', 
'@ADVL', '@CNP', '@CVP', '@HAB', '@LOC', '@N<', '@N>', '@NES', 
'@Num<', '@OBJ', '@OBJ>', '@P>', '@PC', '@PCLE', '@SUBJ', 
'@SUBJ>', '@X', 'A', 'ABBR', 'ACR', 'Abe', 'Abl', 'Acc', 'Act', 'Ade', 
'Adv', 'AgPrc', 'All', 'Arab', 'Attr', 'CC', 'CS', 'Card', 'Cmp', 'Cmp/Hyph', 
'Cmpnd', 'Coll', 'Comp', 'ConNeg', 'Cond', 'Dash', 'Dem', 'Der/inen', 
'Der/minen', 'Der/s', 'Der/ttain', 'Digit', 'Ela', 'Err/Orth', 'Ess', 
'Foc/han', 'Foc/ka', 'Foc/kaan', 'Foc/kin', 'Foc/pa', 'Gen', 'Gram/IAbbr', 
'Gram/TAbbr', 'Gram/TNumAbbr', 'Hom1', 'Ill', 'Imprt', 'Ind', 
'Indef', 'Ine', 'InfA', 'InfE', 'InfMa', 'Ins', 'Interj', 'Interr', 'Lat', 'N', 
'Neg', 'NegPrc', 'Nom', 'Num', 'Ord', 'Par', 'Pcle', 'Pe4', 'Pers', 'Pl', 
'Pl1', 'Pl2', 'Pl3', 'Po', 'Pot', 'Pr', 'Pref', 'Pref-', 'PrfPrc', 'Pron', 
'Prop', 'Propn', 'Prs', 'PrsPrc', 'Prt', 'Pss', 'Punct', 'Px3', 'PxPl1', 
'PxPl2', 'PxPl3', 'PxSg1', 'PxSg2', 'PxSg3', 'Qnt', 'Qst', 'Qu', 
'Refl', 'Rel', 'Sem/Curr', 'Sem/Fem', 'Sem/Geo', 'Sem/Geon', 
'Sem/Human', 'Sem/Humann', 'Sem/Org', 'Sg', 'Sg1', 'Sg2', 
'Sg3', 'Superl', 'Tra', 'V', 'VPcle', 'ela', 'gen', 'ill', 'n', 'par', 'sg'

While I can guess the meaning of some of them, I'm not exactly sure. I also don't know if this list is complete.

Some morphological tags are missing

I noticed some unexpected behaviour. When I do:

from uralicNLP.cg3 import Cg3
cg = Cg3("kpv")
cg.disambiguate(["Ми", "уджалам", "вӧр", "керкаын", "."])

I get the result:

[('Ми', [<ми - Pron, Pers, Pl1, Nom, <W:0.000000>>]),
 ('уджалам', [<уджавны - V, IV, <W:0.000000>>]),
 ('вӧр',
  [<вӧр - N, Sg, Acc, <W:0.000000>>,
   <вӧр - N, Sg, Nom, <W:0.000000>>,
   <вӧрны - V, IV, <W:0.000000>>]),
 ('керкаын', [<керка - N, Sg, Ine, <W:0.000000>>]),
 ('.', [<. - CLB, <W:0.000000>>])]

The problem is that the verb уджавны 'to work' has only the tags V and IV, missing the others for person and tense etc. It seems that all verbs behave this way at the moment.

echo "Ми уджалам вӧр керкаын." | hfst-tokenise --giella-cg -W $GTHOME/langs/kpv/tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst | vislcg3 -g $GTHOME/langs/kpv/src/syntax/disambiguator.cg3

"<Ми>"
	"ми" Pron Pers Pl1 Nom
: 
"<уджалам>"
	"уджавны" V IV Ind Prs Pl1
: 
"<вӧр>"
	"вӧр" N Sg Acc
	"вӧр" N Sg Nom
: 
"<керкаын>"
	"керка" N Sg Ine
"<.>"
	"." CLB

Otherwise everything is working very nicely!

[BUG] Estonian lemmatization never finishes

Describe the bug
Estonian lemmatization does not return any results, the call gets stuck forever.

To Reproduce
Steps to reproduce the behavior:

  1. Build this dockerfile
FROM python:3.6.13-slim-buster
RUN pip install uralicNLP
RUN python -m uralicNLP.download --languages est
RUN python -c 'from uralicNLP import uralicApi; uralicApi.lemmatize("suvesse", "est")'
  1. Build will fail because the last command never finishes

Expected behavior
A lemmatization result is returned shortly after entering the command.

Desktop (please complete the following information):

  • OS: Debian
  • Version: Buster
  • Python version 3.6.13

Komi-Zyrian missing from list

I don't see Komi-Zyrian now among the supported languages. However, it seems to be working already:

uralicApi.lemmatize("пукалыштам", "kpv")
{'results': ['пукавны', 'пукны', 'пукалыштны']}

Thanks putting this together, looks very useful!

Refactor to make a more generalized cg3 library?

I am working on a similar project for Russian, and I just got around to trying to implement vislcg3 in python, and I found your repo.

The same way that both of our modules depend on the hfst python module, I was planning on using/making a separate module for cg3 that my project would depend on.

Would you be interested in refactoring your code to split into two different modules? One to be a more generalizable python implementation of cg3 subprocessing, and then your uralicNLP and my udar could both simply depend on that module for the cg3 parts of our projects. (Whereas currently, if I understand correctly, your cg3.py combines downloading models, checking your online service, and the actual working of calling the subprocess to process input with a grammar.)

Let me know if you're interested.

nds is not available [BUG] 🐛 🕷 🐞

Describe the bug
nds model is missing or not available for download

To Reproduce
Install package

pip install uralicNLP

Download nds model

python -m uralicNLP.download -l nds
# or in python
# from uralicNLP import uralicApi
# uralicApi.model_info("nds")

The error message:

Downloading analyser for nds
 99% 48/48.640625 [00:00<00:00, 292.24it/s]
Model analyser for nds was downloaded
Downloading analyzer.pt for nds
No content-length, cannot show progress for download
Couldn't download analyzer.pt for nds. It might be that the model for the language is not supported yet.
Downloading generator.pt for nds
No content-length, cannot show progress for download
Couldn't download generator.pt for nds. It might be that the model for the language is not supported yet.
Downloading lemmatizer.pt for nds
No content-length, cannot show progress for download
Couldn't download lemmatizer.pt for nds. It might be that the model for the language is not supported yet.
Downloading analyser-norm for nds
 98% 44/44.9248046875 [00:00<00:00, 260.70it/s]
Model analyser-norm for nds was downloaded
Downloading analyser-dict for nds
No content-length, cannot show progress for download
Couldn't download analyser-dict for nds. It might be that the model for the language is not supported yet.
Downloading generator-desc for nds
 98% 43/43.9716796875 [00:00<00:00, 254.54it/s]
Model generator-desc for nds was downloaded
Downloading generator-norm for nds
100% 43/43.1123046875 [00:00<00:00, 254.70it/s]
Model generator-norm for nds was downloaded
Downloading generator for nds
No content-length, cannot show progress for download
Couldn't download generator for nds. It might be that the model for the language is not supported yet.
Downloading cg for nds
 99% 68/68.82421875 [00:00<00:00, 396.78it/s]
Model cg for nds was downloaded
Downloading metadata.json for nds
 75% 1/1.33203125 [00:00<00:00, 553.56it/s]
Model metadata.json for nds was downloaded
Downloading dictionary.json for nds
 98% 1/1.015625 [00:00<00:00, 577.17it/s]
Model dictionary.json for nds was downloaded

Calling uralicApi.dictionary_lemmas("nds") would result in

EmptyDatabaseException: The dictionary is empty for nds in path /usr/local/lib/python3.7/dist-packages/uralicNLP/models/nds/dictionary.json

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: Google Colab
  • Version [e.g. 10]
  • HFST version
  • CG3 version
  • Python version (2 is not supported at all): 3.7

Additional context
Add any other context about the problem here.

how to handle multiple generated forms

I am trying to generate particular form for a set of words. Mostly everything produces good results, but sometimes there are two results, such as here:

uralicApi.generate("rakennusteline+N+Pl+Ela", "fin")
[('rakennustelinehistä', 0.0), ('rakennustelineistä', 0.0)]

It appears the first result is in some local dialect, and the second form is more correct literary form. Why is this? Is there a way to limit the output to just literary form?

Initial Update

The bot created this issue to inform you that pyup.io has been set up on this repo.
Once you have closed it, the bot will open pull requests for updates as soon as they are available.

[BUG] 🐛 🕷 🐞 wrong superlative for Finnish 'halpa'

uralicNLP.generate(query="halpa+A+Superl+Sg+Nom", language="fin")

expected: [('halvin', 0.0)]
actual: [('halvoin', 0.0)]

Seems like probably a giellalt issue, but do they even have an issue tracker?

Also, what is the meaning of 0.0 as the second element of like every single tuple? What is that for?

[BUG] Finnish morphological generation produces empty output in v1.5.1 (and v1.5.0)

Describe the bug
Finnish morphological generation produces empty output in v1.5.1 (and v1.5.0).

To Reproduce
Steps to reproduce the behavior:

  1. Run
docker run -it --rm python:3.8-slim-buster bash
pip install uralicNLP==1.5.1
python -m uralicNLP.download --languages fin
python
>>> from uralicNLP import uralicApi
>>> uralicApi.generate("käsi+N+Sg+Par", "fin")
  1. Output:
    []

Expected behavior
Expected output:
[['kättä', 0.0]]

Desktop (please complete the following information):

  • OS: Debian 10 (in Docker) / Ubuntu 18.04 (host)
  • Python version: 3.8.17

Adding Constraint Grammar disambiguation

For several of these languages there are already disambiguation rules which remove most of the readings. Would it be possible somehow to add this into uralicNLP? I have no ideas about the technical problems which may be related to this, but it would make my life million times easier. This would demand sending more than one token at the time, but I assume this would be practical anyway? Thank you for fantastic work!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.