ufal / morphodita Goto Github PK

MorphoDiTa: Morphologic Dictionary and Tagger

License: Mozilla Public License 2.0

Makefile 0.62% Java 0.22% Perl 0.37% Python 0.59% C++ 88.07% HTML 6.87% C# 0.22% C 0.35% Shell 0.34% Ragel 1.74% SWIG 0.63% XS 0.01%

morphodita's People

Contributors

Stargazers

Watchers

Forkers

oplatek jnv abhinavwalia95 kocmitom scharka dlukes jackob32

morphodita's Issues

Installation of Python bindings fails on Python 3.8

The compilation error is:

morphodita/morphodita_python.cpp:3321:3: error: use of undeclared identifier '_PyObject_GC_UNTRACK'
      _PyObject_GC_UNTRACK(descr);
      ^

I guess this is caused by the fact that versions of SWIG prior to August 2019 did not support Python 3.8 (cf. http://www.swig.org/)? If so, would you please consider doing a new release of the Python bindings on PyPI using this new version of SWIG?

MorphoDiTa returns empty lemma

I've found out morphological analysis might return empty lemma for some forms. Is this a weird feature, or is it a bug? For me, it was completely unexpected and I could not find any note on this even a posteriori.

An example form set which, using Czech Morfflex PDT from 15th Nov 2016, demonstrates this is ['Řekni', 'I.', 'ovi', '.']. The resulting lemmata are ['říci', 'i.', '', '.']. All tokens have a reasonable tag, though.

I use Python bindings with MorphoDiTa installed via pip in version 1.9.2.1. When using MorphoDiTa as a Lindat service, it gets more interesting. When giving 'Řekni I. ovi.' as a plain text, it splits 'I.' into two tokens but the 'ovi' again gets lemma-less. Giving it in vertical format leads to getting a lemma even for the 'ovi' part, though. However, if the sentences turns into 'Daří se I. ovi.' and is provided in vertical format, the empty lemma problem arises again.

(Actually, I would agree that the text does not seem to be tokenized correctly and that the middle two forms should come together, however I've got hundreds of similar sentences in my corpus (ČNK syn v4, if interested) and as the corpus comes pre-tokenized, it seems better to go with the existing tokenization, no matter how imperfect it might be. Anyway, as I've already said, the behaviour is completely unexpectable for me and caused introducing some bugs in my data.)

Add Python type stubs

The Python SWIG invocation uses -builtin, presumably for performance reasons? That means that no wrapper classes are generated in morphodita.py, which in turn means that static analysis tools can't introspect MorphoDiTa's public API without importing the native library, which they often don't do by default (Pylint) or at all (Pyright is a notable and popular example).

Would you consider adding a type stub file to remedy this? I took a stab at creating one (because I finally got fed up with Pyright complaining that Tagger is not a known member of module, or missing out on other diagnostics when silencing the complaint with # type: ignore :) ) which you can freely use to get started if you'd like: https://github.com/dlukes/dotfiles/blob/master/python/typings/ufal/morphodita.pyi

Can be the performance of the guesser improved?

I noticed that that the tagger with guesser enabled is sometimes very slow. For example the 100-tokens-long sentence bellow took the czech tagger about 3s on my laptop. And it took only 55 ms without the guesser.

Could be the performance of the guesser improved?

(Sorry I know its actually Slovak, but sometimes the data is not clean...)

Imunogény pozostávajú z obalových polypeptidov E vírusov s mol. hmot. cca 57 000 hm. j., s nasledujúcim sledom aminokyselín (KE): SRCTHLENRD FVTGTQGTTR VTL VLELGGC VTITAEGKPS MDVWLDATYQ ENPAKTREYC LHAKLSDTKV AARCPT MGPA TLAEEHQGGT VKVEPHTGDY VAANETHSGR KTASFTISSE KTTLTMGEYG DVSL LCRVAS GVDLAQTVIL ELDKTVEHLP TAWQVHRDWF NDLALPWHKE GAQNWNNA ER LVEFGAPHAV KMDVYNLGDQ TGVLLKALAG VPVAHIEGTK YHLKSGHVTC EVGLEKLKMK GLTYTMCDKT KFTWKRAPTD SGHDTVVMEV TFSGTKPCRI PVRA VAHGSP DVNVAMLITP NPTIENNGGG FIEMQLPPGD NIIYVGELSH QWFOKGSS TG RVFQKTKKGI ERLTVIGEHA WDFGSAGGFL SSIGKAVHTV LGGAFNSIFG GVGFLPKLLL GVALAWLGLN MRNPTMSMSF LLAGGLVLAM GLGVGA, ktoré sú komplexne viazané svojími hydrofóbnymi C-koncami s bakteriálnymi proteozónami.

Tokens lost when splitting long sentences

Sorry for being so hung up on tokenization, but I think I've come across another minor bug. Unfortunately, it seems to be related to the sentence-splitting of extremely long sentences, so I haven't found a nice and clean way to reproduce it other than using this monster as input. Without further ado:

$ wget https://gist.githubusercontent.com/dlukes/ad3f99a079641855f028fa8ba1ea68e2/raw/63b77c35010e860474dc277ed9f90a59381ca90a/sample.txt

Now look for the following pattern, which occurs only once in the file:

$ grep -o '(21:47 DEedhCjlyt :0 YQTxWzFwPQ  DEedhCjlyt 8) YQTxWzFwPQ  49)' sample.txt
(21:47 DEedhCjlyt :0 YQTxWzFwPQ  DEedhCjlyt 8) YQTxWzFwPQ  49)

Notice that there is :0 roughly at the end of the first quarter of the substring. Now if I tokenize the file using MorphoDiTa d9e7b78 and paste the tokens back together, the 0 is gone:

$ /path/to/md/master/run_tokenizer --tokenizer=czech --output=vertical sample.txt | tail -n+2837 | head -n14 | paste -sd" "
( 21 : 47 DEedhCjlyt :  YQTxWzFwPQ DEedhCjlyt 8 ) YQTxWzFwPQ 49 )

Whereas if I do the same thing using MorphoDiTa stable (1.3), it is preserved:

$ /path/to/md/stable/run_tokenizer --tokenizer=czech --output=vertical sample.txt | tail -n+2838 | head -n14 | paste -sd" "
( 21 :  47 DEedhCjlyt : 0 YQTxWzFwPQ DEedhCjlyt 8 ) YQTxWzFwPQ 49

Admittedly, this is very unusual input, but more "conventional" long sentences will probably be affected by this as well, won't they?

Add a more reproducible example of how to use this model in python

Hi,

it is possible to create a more detailed example of how to use this model/module with python? I am not able to 100% understand how to work with tagger = Tagger.load(sys.argv[1]). Can you provide a more detailed example? thanks.

Using morphodita with nametag raises exception

I tried to to use both morphodita and nametag at the same time, but a strange exception is raised, when I import nametag into the project:

from ufal.morphodita import Tagger, Forms, TokenRanges
import ufal.nametag  # this line causes the problem

forms = Forms()
tokens = TokenRanges()
tagger = Tagger.load(path_to_model)
tokenizer = tagger.newTokenizer()
tokenizer.setText(text)
while tokenizer.nextSentence(forms, tokens):
    pass

Result:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: in method 'Tokenizer_nextSentence', argument 2 of type 'std::vector< std::string,std::allocator< std::string > > *'

If I switch the order of the imports, the exception is not raised in the above example, but if I try to use nametag a different exception is raised.

Misleading docs on determining lemma structure

Me again. Some time ago, I got quite confused by the API reference at ÚFAL, which, regarding parts of lemma structure, states:

These parts are stored in one string and the boundaries between them can be determined by morpho::raw_lemma_len and morpho::lemma_id_len methods.

Given that on PyPI, it says

The bindings is a straightforward conversion of the C++ bindings API.

I had some hard time finding out why there's no lemma_id_len function on Morpho.

I have to admit that especially on PyPI, it shows C++ API as providing the rawLemma and lemmaId function, and the same can be seen in 'C++ Bindings API' section in ÚFAL API reference but still I find this quite misleading.

Perhaps there would be a way to clarify this?

UnicodeDecodeError in run_tagger.py with English tagger.

I have encountered the following issue when I tested the example code for python bindings:

echo "manifestation of the people’s ‘mental enslavement’" | python run_tagger.py english-morphium-wsj-140407.tagger

The following error pops up:

Traceback (most recent call last):
  File "run_tagger.py", line 81, in <module>
    encode_entities(lemma.lemma),
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7: unexpected end of data

The exception is raised on hitting the word people ending with the ’ (forward-quote). Seems that the string people’s is truncated in the middle of the multibyte UTF8 code sequence for the quote which is \xe2\x80\x99.

The taggers for Czech seem to work fine, at least they do not fail on the quotes.

I am using python2.7 and builded the code and bindings from source on Ubuntu 12.04 with proper versions of g++/swig.

Dictionary fails to load with perl bindings

When trying to load a custom dictionary using the perl bindings, ...

my $morpho = Morpho::load($path);

... the attempt fails with the following error:

perl: morpho/morpho_dictionary.h:109: void ufal::morphodita::morpho_dictionary<LemmaAddinfo>::load(ufal::morphodita::binary_decoder&) [with LemmaAddinfo = ufal::morphodita::czech_lemma_addinfo]: Assertion `lemma_offset < (1<<24) && lemma_len < (1<<8)' failed.
[1]    179285 abort (core dumped)

The dictionary was encoded using MorphoDiTa v1.3.0, the bindings are installed from CPAN and I've also tried compiling them manually but the result is the same. run_morpho_analyze loads the dictionary without problems.

I'd be grateful for any possible pointers as to what might be the problem! :)

Tokenizer: Don't split extended grapheme clusters

Currenty, MorphoDiTa tokenizers split e.g. 🇵🇱 into 🇵 and 🇱. It would be nice if they didn't :)

(Cf. e.g. here for some background.)

Tokenizing URLs redux

URLs now allow non-ASCII characters (as discussed in #4 and fixed in f4d691a, thanks!), but a different problem has appeared -- the http:// prefix is split into separate tokens (as of current master, 5bb38a9):

$ echo 'Na adrese http://www.karaoketexty.cz/plíhal je dostupný...' | ./run_tokenizer --tokenizer czech --output vertical
Na
adrese
http
:
/
/
www.karaoketexty.cz/plíhal
je
dostupný
.
.
.

Perhaps this is in the process of being addressed, in which case don't mind me :)

Request - error reporting on model load failure

Me one more time, but if nothing else, by this I run out of issues I already have :)
Also, this one is not a bug, rather a feature/improvement request.

I understand that model load may fail (for example because the user provides path which does not lead to a valid model file) and it is the user's responsibility to check if this didn't happen.
However, I wonder whether an exception could be raised in such case. To me, this feels less error-prone and making it easier to locate the error (esp. if the exception would give some details, like 'file is not found' or 'file not valid model').

Derivation formatter is broken (segfault / no-op)

When I try to retrieve derivations using the run_morpho_analyze CLI tool, I get a segfault. E.g. with this command:

echo přijít | ./run_morpho_analyze --from_tagger --derivation=root czech-morfflex-pdt-161115.tagger 1

When I try to do so through the Python bindings (assuming the script below uses them correctly...), there's no segfault, but there's no derivation either, it's basically a no-op:

from ufal.morphodita import *
df = DerivationFormatter.newRootDerivationFormatter(
    Tagger.load("czech-morfflex-pdt-161115.tagger").getMorpho().getDerivator()
)
print(df.formatDerivation("přijít"))

This just prints přijít. (If that's the expected output, then derivation probably means something else than I think it does ;) )

Endpoint `tag` returns `spaces` in result_object

Endpoint tag returns result_object with spaces instead of space. This is not mentioned in doc https://lindat.mff.cuni.cz/services/morphodita/api-reference.php#tag

Url to reproduce: http://lindat.mff.cuni.cz/services/morphodita/api/tag?output=json&model=czech-morfflex&data=.4.+&guesser=no

Response

{
  "model": "czech-morfflex-pdt-161115",
  "acknowledgements": [
    "http://ufal.mff.cuni.cz/morphodita#morphodita_acknowledgements",
    "http://ufal.mff.cuni.cz/morphodita/users-manual#czech-morfflex-pdt_acknowledgements"
  ],
  "result": [
    [
      { "token": ".", "lemma": ".", "tag": "Z:-------------" },
      { "token": "4", "lemma": "4", "tag": "C=-------------" },
      { "token": ".", "lemma": ".", "tag": "Z:-------------", "spaces": " " }
    ]
  ]
}

Feature request

Dear Colleague,

I could quite appreciate having also a slightly modified MorphoDiTa imput mode: vertical with XML tags for documents, paragraphs, sentences, etc. For example:

<doc title="Hamlet">
<p>
<s>
To
be
or 
not
to
be
<g/>
,
that
is
a
question
<g/>
.
</s>
</p>
</doc>

Within this mode, the tagger would simply ignore all XML markup, I could also use the <s> tags as end-of-sentence markers to improve the accuracy around punctuation.

I believe that this would not be difficult to implement :-)

Best,

Vlado B, 19:45

Inconsistency in result_object

Following #20
http://lindat.mff.cuni.cz/services/morphodita/api/tag?output=json&model=czech-morfflex&data=.4.+&guesser=no returns spaces parameter
AND
https://lindat.mff.cuni.cz/services/morphodita/api/tag?output=json&model=czech-morfflex&data=4+5&guesser=no returns space parameter
I see the issue in the implementation, not in docs.

Tagset converter + derivator interference

Running with tagset converter raw_lemma and a non-trivial derivator first strips the lemma and only then runs the derivator -- so it gets wrong input and its output is not a raw lemma.

Tokenizing URLs

All tokenizers bundled with MorphoDiTa split URLs at the first occurrence of a character outside the ASCII range:

$ for tok in czech english generic; do echo 'http://cs.wikipedia.org/wiki/Islámské_bankovnictví' | run_tokenizer --tokenizer=$tok --output=vertical; done
http://cs.wikipedia.org/wiki/Isl
ámské
_
bankovnictví

http://cs.wikipedia.org/wiki/Isl
ámské
_
bankovnictví

http://cs.wikipedia.org/wiki/Isl
ámské
_
bankovnictví

Nowadays though, URLs often contain a wide range of Unicode characters (as shown above). Would you please consider altering the behavior of the tokenizers in this respect? :)

P.S. I've checked to see whether this is somehow affected by locale settings and it doesn't seem to be.

Segfault when used from Python alongside another SWIG-generated lib

I've stumbled upon a segfault when using MorphoDiTa from Python alongside Manatee (another SWIG-generated library). Specifically, if I import Manatee after MorphoDiTa, then MorphoDiTa segfaults when calling tokenizer.nextSentence. Here's a reproducible example with more details (using Docker): https://github.com/dlukes/morphodita-manatee-python-segfault.

I have also reported this issue to the developers of Manatee on their mailing list as it's unclear to me at this point what the root cause is -- it may be that MorphoDiTa/NameTag doesn't isolate itself well enough from other extension modules, or it may be that Manatee oversteps its boundaries. Given that loading e.g. both MorphoDiTa and NameTag into the same Python interpreter works fine, I'm somewhat leaning towards the latter though.

I realize you may not have bandwidth to look at this since it's a fairly niche issue, and it can be worked around by not having all of these at once in the same Python process. However, if the root cause is a bug lurking somewhere in MorphoDiTa's SWIG interface definition, it may have other unwanted consequences, so I deemed it worth reporting anyway.

Namespace issue when importing Perl 5 MorphoDiTa bindings

I've noticed strange namespace behavior in Perl 5, when importing the SWIG-generated MorphoDiTa bindings into a source file which has a package declaration of its own. I did a write-up of the issue on StackOverflow, please see there for details.

I realize this may be due to SWIG/Perl and have reported this on SWIG's GitHub repo, but I thought I'd let you know as well :)

Behavior confirmed using the current release MorphoDiTa 1.3.0 on both Linux and OS X.