ufal / morphodita Goto Github PK
View Code? Open in Web Editor NEWMorphoDiTa: Morphologic Dictionary and Tagger
License: Mozilla Public License 2.0
MorphoDiTa: Morphologic Dictionary and Tagger
License: Mozilla Public License 2.0
Dear Colleague,
I could quite appreciate having also a slightly modified MorphoDiTa imput mode: vertical with XML tags for documents, paragraphs, sentences, etc. For example:
<doc title="Hamlet">
<p>
<s>
To
be
or
not
to
be
<g/>
,
that
is
a
question
<g/>
.
</s>
</p>
</doc>
Within this mode, the tagger would simply ignore all XML markup, I could also use the <s>
tags as end-of-sentence markers to improve the accuracy around punctuation.
I believe that this would not be difficult to implement :-)
Best,
Vlado B, 19:45
I've noticed strange namespace behavior in Perl 5, when importing the SWIG-generated MorphoDiTa bindings into a source file which has a package declaration of its own. I did a write-up of the issue on StackOverflow, please see there for details.
I realize this may be due to SWIG/Perl and have reported this on SWIG's GitHub repo, but I thought I'd let you know as well :)
Behavior confirmed using the current release MorphoDiTa 1.3.0 on both Linux and OS X.
Running with tagset converter raw_lemma
and a non-trivial derivator first strips the lemma and only then runs the derivator -- so it gets wrong input and its output is not a raw lemma.
I tried to to use both morphodita and nametag at the same time, but a strange exception is raised, when I import nametag into the project:
from ufal.morphodita import Tagger, Forms, TokenRanges
import ufal.nametag # this line causes the problem
forms = Forms()
tokens = TokenRanges()
tagger = Tagger.load(path_to_model)
tokenizer = tagger.newTokenizer()
tokenizer.setText(text)
while tokenizer.nextSentence(forms, tokens):
pass
Result:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: in method 'Tokenizer_nextSentence', argument 2 of type 'std::vector< std::string,std::allocator< std::string > > *'
If I switch the order of the imports, the exception is not raised in the above example, but if I try to use nametag a different exception is raised.
Currenty, MorphoDiTa tokenizers split e.g. 🇵🇱 into 🇵 and 🇱. It would be nice if they didn't :)
(Cf. e.g. here for some background.)
I have encountered the following issue when I tested the example code for python bindings:
echo "manifestation of the people’s ‘mental enslavement’" | python run_tagger.py english-morphium-wsj-140407.tagger
The following error pops up:
Traceback (most recent call last):
File "run_tagger.py", line 81, in <module>
encode_entities(lemma.lemma),
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7: unexpected end of data
The exception is raised on hitting the word people
ending with the ’ (forward-quote). Seems that the string people’s
is truncated in the middle of the multibyte UTF8 code sequence for the quote which is \xe2\x80\x99
.
The taggers for Czech seem to work fine, at least they do not fail on the quotes.
I am using python2.7 and builded the code and bindings from source on Ubuntu 12.04 with proper versions of g++/swig.
I've stumbled upon a segfault when using MorphoDiTa from Python alongside Manatee (another SWIG-generated library). Specifically, if I import Manatee after MorphoDiTa, then MorphoDiTa segfaults when calling tokenizer.nextSentence
. Here's a reproducible example with more details (using Docker): https://github.com/dlukes/morphodita-manatee-python-segfault.
I have also reported this issue to the developers of Manatee on their mailing list as it's unclear to me at this point what the root cause is -- it may be that MorphoDiTa/NameTag doesn't isolate itself well enough from other extension modules, or it may be that Manatee oversteps its boundaries. Given that loading e.g. both MorphoDiTa and NameTag into the same Python interpreter works fine, I'm somewhat leaning towards the latter though.
I realize you may not have bandwidth to look at this since it's a fairly niche issue, and it can be worked around by not having all of these at once in the same Python process. However, if the root cause is a bug lurking somewhere in MorphoDiTa's SWIG interface definition, it may have other unwanted consequences, so I deemed it worth reporting anyway.
When trying to load a custom dictionary using the perl bindings, ...
my $morpho = Morpho::load($path);
... the attempt fails with the following error:
perl: morpho/morpho_dictionary.h:109: void ufal::morphodita::morpho_dictionary<LemmaAddinfo>::load(ufal::morphodita::binary_decoder&) [with LemmaAddinfo = ufal::morphodita::czech_lemma_addinfo]: Assertion `lemma_offset < (1<<24) && lemma_len < (1<<8)' failed.
[1] 179285 abort (core dumped)
The dictionary was encoded using MorphoDiTa v1.3.0, the bindings are installed from CPAN and I've also tried compiling them manually but the result is the same. run_morpho_analyze
loads the dictionary without problems.
I'd be grateful for any possible pointers as to what might be the problem! :)
Endpoint tag
returns result_object with spaces
instead of space
. This is not mentioned in doc https://lindat.mff.cuni.cz/services/morphodita/api-reference.php#tag
Url to reproduce: http://lindat.mff.cuni.cz/services/morphodita/api/tag?output=json&model=czech-morfflex&data=.4.+&guesser=no
Response
{
"model": "czech-morfflex-pdt-161115",
"acknowledgements": [
"http://ufal.mff.cuni.cz/morphodita#morphodita_acknowledgements",
"http://ufal.mff.cuni.cz/morphodita/users-manual#czech-morfflex-pdt_acknowledgements"
],
"result": [
[
{ "token": ".", "lemma": ".", "tag": "Z:-------------" },
{ "token": "4", "lemma": "4", "tag": "C=-------------" },
{ "token": ".", "lemma": ".", "tag": "Z:-------------", "spaces": " " }
]
]
}
I noticed that that the tagger with guesser enabled is sometimes very slow. For example the 100-tokens-long sentence bellow took the czech tagger about 3s on my laptop. And it took only 55 ms without the guesser.
Could be the performance of the guesser improved?
(Sorry I know its actually Slovak, but sometimes the data is not clean...)
Imunogény pozostávajú z obalových polypeptidov E vírusov s mol. hmot. cca 57 000 hm. j., s nasledujúcim sledom aminokyselín (KE): SRCTHLENRD FVTGTQGTTR VTL VLELGGC VTITAEGKPS MDVWLDATYQ ENPAKTREYC LHAKLSDTKV AARCPT MGPA TLAEEHQGGT VKVEPHTGDY VAANETHSGR KTASFTISSE KTTLTMGEYG DVSL LCRVAS GVDLAQTVIL ELDKTVEHLP TAWQVHRDWF NDLALPWHKE GAQNWNNA ER LVEFGAPHAV KMDVYNLGDQ TGVLLKALAG VPVAHIEGTK YHLKSGHVTC EVGLEKLKMK GLTYTMCDKT KFTWKRAPTD SGHDTVVMEV TFSGTKPCRI PVRA VAHGSP DVNVAMLITP NPTIENNGGG FIEMQLPPGD NIIYVGELSH QWFOKGSS TG RVFQKTKKGI ERLTVIGEHA WDFGSAGGFL SSIGKAVHTV LGGAFNSIFG GVGFLPKLLL GVALAWLGLN MRNPTMSMSF LLAGGLVLAM GLGVGA, ktoré sú komplexne viazané svojími hydrofóbnymi C-koncami s bakteriálnymi proteozónami.
The compilation error is:
morphodita/morphodita_python.cpp:3321:3: error: use of undeclared identifier '_PyObject_GC_UNTRACK'
_PyObject_GC_UNTRACK(descr);
^
I guess this is caused by the fact that versions of SWIG prior to August 2019 did not support Python 3.8 (cf. http://www.swig.org/)? If so, would you please consider doing a new release of the Python bindings on PyPI using this new version of SWIG?
Hi,
it is possible to create a more detailed example of how to use this model/module with python? I am not able to 100% understand how to work with tagger = Tagger.load(sys.argv[1])
. Can you provide a more detailed example? thanks.
All tokenizers bundled with MorphoDiTa split URLs at the first occurrence of a character outside the ASCII range:
$ for tok in czech english generic; do echo 'http://cs.wikipedia.org/wiki/Islámské_bankovnictví' | run_tokenizer --tokenizer=$tok --output=vertical; done
http://cs.wikipedia.org/wiki/Isl
ámské
_
bankovnictví
http://cs.wikipedia.org/wiki/Isl
ámské
_
bankovnictví
http://cs.wikipedia.org/wiki/Isl
ámské
_
bankovnictví
Nowadays though, URLs often contain a wide range of Unicode characters (as shown above). Would you please consider altering the behavior of the tokenizers in this respect? :)
P.S. I've checked to see whether this is somehow affected by locale settings and it doesn't seem to be.
URLs now allow non-ASCII characters (as discussed in #4 and fixed in f4d691a, thanks!), but a different problem has appeared -- the http://
prefix is split into separate tokens (as of current master, 5bb38a9):
$ echo 'Na adrese http://www.karaoketexty.cz/plíhal je dostupný...' | ./run_tokenizer --tokenizer czech --output vertical
Na
adrese
http
:
/
/
www.karaoketexty.cz/plíhal
je
dostupný
.
.
.
Perhaps this is in the process of being addressed, in which case don't mind me :)
I've found out morphological analysis might return empty lemma for some forms. Is this a weird feature, or is it a bug? For me, it was completely unexpected and I could not find any note on this even a posteriori.
An example form set which, using Czech Morfflex PDT from 15th Nov 2016, demonstrates this is ['Řekni', 'I.', 'ovi', '.']. The resulting lemmata are ['říci', 'i.', '', '.']. All tokens have a reasonable tag, though.
I use Python bindings with MorphoDiTa installed via pip in version 1.9.2.1. When using MorphoDiTa as a Lindat service, it gets more interesting. When giving 'Řekni I. ovi.' as a plain text, it splits 'I.' into two tokens but the 'ovi' again gets lemma-less. Giving it in vertical format leads to getting a lemma even for the 'ovi' part, though. However, if the sentences turns into 'Daří se I. ovi.' and is provided in vertical format, the empty lemma problem arises again.
(Actually, I would agree that the text does not seem to be tokenized correctly and that the middle two forms should come together, however I've got hundreds of similar sentences in my corpus (ČNK syn v4, if interested) and as the corpus comes pre-tokenized, it seems better to go with the existing tokenization, no matter how imperfect it might be. Anyway, as I've already said, the behaviour is completely unexpectable for me and caused introducing some bugs in my data.)
Following #20
http://lindat.mff.cuni.cz/services/morphodita/api/tag?output=json&model=czech-morfflex&data=.4.+&guesser=no returns spaces
parameter
AND
https://lindat.mff.cuni.cz/services/morphodita/api/tag?output=json&model=czech-morfflex&data=4+5&guesser=no returns space
parameter
I see the issue in the implementation, not in docs.
The Python SWIG invocation uses -builtin
, presumably for performance reasons? That means that no wrapper classes are generated in morphodita.py
, which in turn means that static analysis tools can't introspect MorphoDiTa's public API without importing the native library, which they often don't do by default (Pylint) or at all (Pyright is a notable and popular example).
Would you consider adding a type stub file to remedy this? I took a stab at creating one (because I finally got fed up with Pyright complaining that Tagger is not a known member of module
, or missing out on other diagnostics when silencing the complaint with # type: ignore
:) ) which you can freely use to get started if you'd like: https://github.com/dlukes/dotfiles/blob/master/python/typings/ufal/morphodita.pyi
Me again. Some time ago, I got quite confused by the API reference at ÚFAL, which, regarding parts of lemma structure, states:
These parts are stored in one string and the boundaries between them can be determined by morpho::raw_lemma_len and morpho::lemma_id_len methods.
Given that on PyPI, it says
The bindings is a straightforward conversion of the C++ bindings API.
I had some hard time finding out why there's no lemma_id_len function on Morpho.
I have to admit that especially on PyPI, it shows C++ API as providing the rawLemma and lemmaId function, and the same can be seen in 'C++ Bindings API' section in ÚFAL API reference but still I find this quite misleading.
Perhaps there would be a way to clarify this?
Sorry for being so hung up on tokenization, but I think I've come across another minor bug. Unfortunately, it seems to be related to the sentence-splitting of extremely long sentences, so I haven't found a nice and clean way to reproduce it other than using this monster as input. Without further ado:
$ wget https://gist.githubusercontent.com/dlukes/ad3f99a079641855f028fa8ba1ea68e2/raw/63b77c35010e860474dc277ed9f90a59381ca90a/sample.txt
Now look for the following pattern, which occurs only once in the file:
$ grep -o '(21:47 DEedhCjlyt :0 YQTxWzFwPQ DEedhCjlyt 8) YQTxWzFwPQ 49)' sample.txt
(21:47 DEedhCjlyt :0 YQTxWzFwPQ DEedhCjlyt 8) YQTxWzFwPQ 49)
Notice that there is :0
roughly at the end of the first quarter of the substring. Now if I tokenize the file using MorphoDiTa d9e7b78 and paste the tokens back together, the 0
is gone:
$ /path/to/md/master/run_tokenizer --tokenizer=czech --output=vertical sample.txt | tail -n+2837 | head -n14 | paste -sd" "
( 21 : 47 DEedhCjlyt : YQTxWzFwPQ DEedhCjlyt 8 ) YQTxWzFwPQ 49 )
Whereas if I do the same thing using MorphoDiTa stable (1.3), it is preserved:
$ /path/to/md/stable/run_tokenizer --tokenizer=czech --output=vertical sample.txt | tail -n+2838 | head -n14 | paste -sd" "
( 21 : 47 DEedhCjlyt : 0 YQTxWzFwPQ DEedhCjlyt 8 ) YQTxWzFwPQ 49
Admittedly, this is very unusual input, but more "conventional" long sentences will probably be affected by this as well, won't they?
When I try to retrieve derivations using the run_morpho_analyze
CLI tool, I get a segfault. E.g. with this command:
echo přijít | ./run_morpho_analyze --from_tagger --derivation=root czech-morfflex-pdt-161115.tagger 1
When I try to do so through the Python bindings (assuming the script below uses them correctly...), there's no segfault, but there's no derivation either, it's basically a no-op:
from ufal.morphodita import *
df = DerivationFormatter.newRootDerivationFormatter(
Tagger.load("czech-morfflex-pdt-161115.tagger").getMorpho().getDerivator()
)
print(df.formatDerivation("přijít"))
This just prints přijít
. (If that's the expected output, then derivation probably means something else than I think it does ;) )
Me one more time, but if nothing else, by this I run out of issues I already have :)
Also, this one is not a bug, rather a feature/improvement request.
I understand that model load may fail (for example because the user provides path which does not lead to a valid model file) and it is the user's responsibility to check if this didn't happen.
However, I wonder whether an exception could be raised in such case. To me, this feels less error-prone and making it easier to locate the error (esp. if the exception would give some details, like 'file is not found' or 'file not valid model').
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.