Giter VIP home page Giter VIP logo

nametag's People

Contributors

foxik avatar strakova avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nametag's Issues

Python bindings doesn't work with Python 3.8+

When I tried to install ufal.nametag I got the following error with Python 3.8 (with Python 3.7 it works fine).

    Running setup.py install for ufal.nametag: started
    Running setup.py install for ufal.nametag: finished with status 'error'
    ERROR: Command errored out with exit status 1:
     command: /srv/venv/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-afntpbwc/ufal-nametag/setup.py'"'"'; __file__='"'"'/tmp/pip-install-afntpbwc/ufal-nametag/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-bzanm4ur/install-record.txt --single-version-externally-managed --compile --install-headers /srv/venv/include/site/python3.8/ufal.nametag
         cwd: /tmp/pip-install-afntpbwc/ufal-nametag/
    Complete output (23 lines):
    running install
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-3.8
    creating build/lib.linux-x86_64-3.8/ufal
    copying ufal/__init__.py -> build/lib.linux-x86_64-3.8/ufal
    copying ufal/nametag.py -> build/lib.linux-x86_64-3.8/ufal
    running build_ext
    building 'ufal_nametag' extension
    creating build/temp.linux-x86_64-3.8
    creating build/temp.linux-x86_64-3.8/nametag
    gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -Inametag/include -I/srv/venv/include -I/usr/local/include/python3.8 -c nametag/nametag.cpp -o build/temp.linux-x86_64-3.8/nametag/nametag.o -std=c++11 -fvisibility=hidden -w
    gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -Inametag/include -I/srv/venv/include -I/usr/local/include/python3.8 -c nametag/nametag_python.cpp -o build/temp.linux-x86_64-3.8/nametag/nametag_python.o -std=c++11 -fvisibility=hidden -w
    nametag/nametag_python.cpp: In function ‘void SwigPyStaticVar_dealloc(PyDescrObject*)’:
    nametag/nametag_python.cpp:3321:3: error: ‘_PyObject_GC_UNTRACK’ was not declared in this scope
       _PyObject_GC_UNTRACK(descr);
       ^~~~~~~~~~~~~~~~~~~~
    nametag/nametag_python.cpp:3321:3: note: suggested alternative: ‘PyObject_GC_UnTrack’
       _PyObject_GC_UNTRACK(descr);
       ^~~~~~~~~~~~~~~~~~~~
       PyObject_GC_UnTrack
    error: command 'gcc' failed with exit status 1
    ----------------------------------------

I found a similar issue in UDPipe that is probably solved by commit 86f1c8c171. Maybe it could help ;-)

Integrate with CLARIN LR Switchboard

Integrate NameTag in https://switchboard.clarin.eu similarily to how UDPipe has been integrated.

-----Original message-----
From: Ondrej Kosarko

mam jeste dotaz k nametagu. Potrebovali bychom ho pridat do switchboardu.
Switchboard uplne nepouziva rest api, ale udela redirect s predvyplnenejma parametrama. Enduser je clovek ne stroj, takze >dostane nejakou vizualizaci.
V udpipe funguje napr [ https://lindat.mff.cuni.cz/services/udpipe/?data=Test+sentence&model=eng | >https://lindat.mff.cuni.cz/services/udpipe/?data=Test+sentence&model=eng ] + myslim ze data muze byt i url souboru s >textem.

Je neco takoveho i v nametagu? Jak dlouho by to pripadne zabralo pretahnout z toho udpipe kodu?
Mimochodem jde vubec v nametagu vybrat model iso kodem jazyka?

v Nametagu nic takového zatím není. Tu podporu psal ve skutečnosti Josef,
je to v https://github.com/ufal/udpipe/blob/master/web/lindat-service/fill-using-params.js
a https://github.com/ufal/udpipe/blob/master/web/lindat-service/run.php#L262-L281 .
Přetáhnout to do NameTagu by měla být relativně přímočará záležitost
(copy-paste, zkusit, a případně malinko pošťouchat, kdyby to z nějakého
důvodu nefungovalo).

A ano, model vybrat iso kódem jazyka (naštěstí) jde.

Wrong token ranges when sentences are in vertical input

Nametag gives wrong result for vertical input and output when sentences are split with new line

left: No double endlines at input
right: double endlines split sentences
image

It seems that overflow happen every thousand tokens:
10-times bigger input:
image

Enhancement: Accept data from request body

Currently, data has to be sent as part of the URL.
Which is not ideal because nametag comfortably handles large requests (thousands of characters) but it is not standard to use URLs longer than a few thousands of characters.

I propose to accept data from request body instead or in addition to accepting data from URL.

Unexpected category in czech-cnec2.0-200831 model

For this sentence:

Dobrý den, dámy a pánové, já bych si dovolil ještě navrhnout jednu změnu v pevném zařazení, a to konkrétně v bodu 68, sněmovní tisk 51, Výroční zprávy a účetní závěrky zdravotních pojišťoven za rok 2012, a to na pátek 14. 2. po bloku třetích čtení.

nametag returns unexpected category C - Bibliography container, this category is not defined in https://ufal.mff.cuni.cz/~strakova/cnec2.0/ne-type-hierarchy.pdf

image

Why can't two words have same brown cluster representation?

When I run train_ner with BrownClusters feature enabled I get following output:

Loading train data: done, 8158 sentences
Loading heldout data: done, 899 sentences
Parsing feature templates: Form '0000000' is present twice in Brown cluster file 'clusters/cs_brown_1000'!
Cannot initialize feature template sentence processor 'BrownClusters' from line 'BrownClusters/2 clusters/cs_brown_1000' of feature templates file!

Why exactly can't be Form '0000000' present twice?
It seems like a unnecessary limitation. As far as I know all words with the same prefix belong into one cluster. Therefore any additional bits after chosen prefix are irrelevant.
(Eg. with prefix of length 20 any bits after 20th bit are irrelevant.)
Am I missing something?

Best regards.
Simon Let

Nametag REST server fails when compiled in debug mode

Reproduce the bug:

  1. Change compilation mode to debug: Edit line 81 in Makefile.builtem from: MODE=normal to MODE=debug

  2. Recompile the server: run make server

  3. Run the server

  4. Send any recognize request to server URL:PORT/recognize?data=whatever

Output:

Successfully started nametag_server on port 8080.
/usr/include/c++/4.9/debug/vector:357:error: attempt to subscript container 
    with out-of-bounds index 0, but container only holds 0 elements.

Objects involved in the operation:
sequence "this" @ 0x0x7f76200091b0 {
  type = NSt7__debug6vectorINS0_IjSaIjEEESaIS2_EEE;
}
Aborted

Best regards,

Simon Let

Duplicity rows in nametag output

There are duplicated rows for this input:

Směrnice Evropského parlamentu a Rady číslo 98/70 ES o jakosti benzinu a motorové nafty stanoví závazný cíl dosáhnout do roku 2020 6 % snížení emisí skleníkových plynů z pohonných hmot v porovnání s rokem 2010.

image

...skipping...
image

Memory Leak in Java Binding

When I run following code in java (with -Xmx100m option to limit the heap memory), the executed process quite quickly consumes more and more RAM memory: starting at about 300MB going to 2GB in less than a minute and continues growing...

I'm quite sure that the java code is ok, so the memory leak has to be in the native C++ code. Note also that the leaked memory does not belong to the java heap because of the -Xmx100m option.

Tested no windows but similar behavior observed on Centos Linux as well.

NameTag version 1.1.1

I don't have C++ toolkit ready, so a didn't test it directly (without java), but I can add more details if needed.

import cz.cuni.mff.ufal.nametag.*;

public class RunNer {
	public static void main(String[] args) {
		Ner ner = Ner.load("target/models/czech-cnec2.0-140304.ner");

		Forms forms = new Forms();
		TokenRanges tokens = new TokenRanges();
		NamedEntities entities = new NamedEntities();
		Tokenizer tokenizer = ner.newTokenizer();

		for (int r = 0; r < 10000000; r++) {
			String text = "Václav Havel byl prezident České Republiky";
			tokenizer.setText(text);
			while (tokenizer.nextSentence(forms, tokens)) {
				ner.recognize(forms, entities);
			}
			if (r % 10000 == 0)
				System.err.println(r);
		}
	}
}

Server returns invalid json when there is no data.

This is different bug than #6

These two calls give me Invalid JSON:
>>> curl "localhost:8000/recognize?data="
>>> curl "localhost:8000/recognize?output=vertical&data="

{
 "model": "cnec2.ner",
 "acknowledgements": [
  "http://ufal.mff.cuni.cz/nametag#nametag_acknowledgements",
  "http://ufal.mff.cuni.cz/morphodita#morphodita_acknowledgements"
 ],
 "result": }

Feel free to contact me to get more information.

NameTag2 returns code 400 + internal error for specific sentences [nametag2]

For sentence Kdy slaví svátek Oto NameTag2 returns status code 400 and response: An internal error occurred during processing.
This happens on localhost for all output types. Based on the stacktrace, the error occures because wembeddings throws also an error (see below).

Curl to test it curl --location --request GET 'localhost:8001/recognize?data=Kdy slaví svátek Oto&output=vertical'

On the http://lindat.mff.cuni.cz/services/nametag/ it behaves a bit differently, as the sentence Kdy slaví svátek Oto works (=returns some result) for all output modes except vertical.
Screenshot 2022-03-01 at 14 49 37

NameTag2 log:

2022-03-01T14:08:53Z Traceback (most recent call last):
2022-03-01T14:08:53Z   File "nametag2_server.py", line 521, in do_GET
2022-03-01T14:08:53Z     output = model.predict(output)
2022-03-01T14:08:53Z   File "nametag2_server.py", line 174, in predict
2022-03-01T14:08:53Z     self.network.predict("test", dataset, self.args, output, evaluating=False)
2022-03-01T14:08:53Z   File "/srv/nametag/nametag2_network.py", line 387, in predict
2022-03-01T14:08:53Z     batch_dict = dataset.next_batch(args.batch_size, including_charseqs=args.including_charseqs, seq2seq=seq2seq)
2022-03-01T14:08:53Z   File "/srv/nametag/nametag2_dataset.py", line 223, in next_batch
2022-03-01T14:08:53Z     return self._next_batch(batch_perm, including_charseqs, seq2seq)
2022-03-01T14:08:53Z   File "/srv/nametag/nametag2_dataset.py", line 304, in _next_batch
2022-03-01T14:08:53Z     for i, embeddings in enumerate(self._bert.compute_embeddings("bert-base-multilingual-uncased-last4", batch_sentences)):
2022-03-01T14:08:53Z   File "/srv/nametag/wembedding_service/wembeddings/wembeddings.py", line 168, in compute_embeddings
2022-03-01T14:08:53Z     data=json.dumps({"model": model, "sentences": sentences}, ensure_ascii=True).encode("ascii"),
2022-03-01T14:08:53Z   File "/usr/lib/python3.5/urllib/request.py", line 163, in urlopen
2022-03-01T14:08:53Z     return opener.open(url, data, timeout)
2022-03-01T14:08:53Z   File "/usr/lib/python3.5/urllib/request.py", line 472, in open
2022-03-01T14:08:53Z     response = meth(req, response)
2022-03-01T14:08:53Z   File "/usr/lib/python3.5/urllib/request.py", line 582, in http_response
2022-03-01T14:08:53Z     'http', request, response, code, msg, hdrs)
2022-03-01T14:08:53Z   File "/usr/lib/python3.5/urllib/request.py", line 510, in error
2022-03-01T14:08:53Z     return self._call_chain(*args)
2022-03-01T14:08:53Z   File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain
2022-03-01T14:08:53Z     result = func(*args)
2022-03-01T14:08:53Z   File "/usr/lib/python3.5/urllib/request.py", line 590, in http_error_default
2022-03-01T14:08:53Z     raise HTTPError(req.full_url, code, msg, hdrs, fp)
2022-03-01T14:08:53Z urllib.error.HTTPError: HTTP Error 400: Bad Request
2022-03-01T14:08:53Z 10.88.0.9 - - [01/Mar/2022 14:08:53] "GET /recognize?data=Kdy%20slav%C3%AD%20sv%C3%A1tek%20Oto&output=vertical HTTP/1.1" 400 -

Wembeddings error:

2022-03-01T14:08:53Z Traceback (most recent call last):
2022-03-01T14:08:53Z   File "/srv/wembeddings/wembeddings/wembeddings_server.py", line 67, in do_POST
2022-03-01T14:08:53Z     sentences_embeddings = request.server._wembeddings.compute_embeddings(model, sentences)
2022-03-01T14:08:53Z   File "/srv/wembeddings/wembeddings/wembeddings.py", line 143, in compute_embeddings
2022-03-01T14:08:53Z     embeddings_with_parts = model.compute_embeddings(np_subwords, np_segments).numpy()
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1655, in __call__
2022-03-01T14:08:53Z     return self._call_impl(args, kwargs)
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1665, in _call_impl
2022-03-01T14:08:53Z     cancellation_manager)
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1745, in _call_with_structured_signature
2022-03-01T14:08:53Z     return self._filtered_call(args, kwargs, cancellation_manager)
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
2022-03-01T14:08:53Z     cancellation_manager=cancellation_manager)
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
2022-03-01T14:08:53Z     ctx, args, cancellation_manager=cancellation_manager))
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 550, in call
2022-03-01T14:08:53Z     ctx=ctx)
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
2022-03-01T14:08:53Z     inputs, attrs, num_outputs)
2022-03-01T14:08:53Z tensorflow.python.framework.errors_impl.InvalidArgumentError:  indices[1,3] = -1 is not in [0, 105879)
2022-03-01T14:08:53Z 	 [[node tf_bert_model/bert/embeddings/Gather (defined at usr/local/lib/python3.6/dist-packages/transformers/models/bert/modeling_tf_bert.py:190) ]] [Op:__inference_compute_embeddings_7904]
2022-03-01T14:08:53Z 
2022-03-01T14:08:53Z Errors may have originated from an input operation.
2022-03-01T14:08:53Z Input Source operations connected to node tf_bert_model/bert/embeddings/Gather:
2022-03-01T14:08:53Z  subwords (defined at srv/wembeddings/wembeddings/wembeddings.py:68)
2022-03-01T14:08:53Z 
2022-03-01T14:08:53Z Function call stack:
2022-03-01T14:08:53Z compute_embeddings
2022-03-01T14:08:53Z 

Wembeddings full log (with errors from TensorFlow at the beginning):

2022-03-01T14:07:49Z OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k
2022-03-01T14:07:50Z 2022-03-01 14:07:50.930771: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2022-03-01T14:07:50Z 2022-03-01 14:07:50.930932: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-03-01T14:07:53Z Starting WEmbeddings server on port 8000.
2022-03-01T14:07:53Z To stop it gracefully, either send SIGINT (Ctrl+C) or SIGUSR1.
2022-03-01T14:08:29Z 2022-03-01 14:08:29.779715: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-03-01T14:08:29Z 2022-03-01 14:08:29.779901: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2022-03-01T14:08:29Z 2022-03-01 14:08:29.780000: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (4e281229591c): /proc/driver/nvidia/version does not exist
2022-03-01T14:08:29Z 2022-03-01 14:08:29.780756: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
2022-03-01T14:08:29Z To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-01T14:08:29Z 2022-03-01 14:08:29.794271: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2591965000 Hz
2022-03-01T14:08:29Z 2022-03-01 14:08:29.795785: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fa6bc155490 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-03-01T14:08:29Z 2022-03-01 14:08:29.796253: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2022-03-01T14:08:42Z Some layers from the model checkpoint at bert-base-multilingual-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
2022-03-01T14:08:42Z - This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
2022-03-01T14:08:42Z - This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
2022-03-01T14:08:42Z All the layers of TFBertModel were initialized from the model checkpoint at bert-base-multilingual-uncased.
2022-03-01T14:08:42Z If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.
2022-03-01T14:08:52Z WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py:574: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
2022-03-01T14:08:52Z Instructions for updating:
2022-03-01T14:08:52Z Use fn_output_signature instead
2022-03-01T14:08:53Z Traceback (most recent call last):
2022-03-01T14:08:53Z   File "/srv/wembeddings/wembeddings/wembeddings_server.py", line 67, in do_POST
2022-03-01T14:08:53Z     sentences_embeddings = request.server._wembeddings.compute_embeddings(model, sentences)
2022-03-01T14:08:53Z   File "/srv/wembeddings/wembeddings/wembeddings.py", line 143, in compute_embeddings
2022-03-01T14:08:53Z     embeddings_with_parts = model.compute_embeddings(np_subwords, np_segments).numpy()
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1655, in __call__
2022-03-01T14:08:53Z     return self._call_impl(args, kwargs)
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1665, in _call_impl
2022-03-01T14:08:53Z     cancellation_manager)
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1745, in _call_with_structured_signature
2022-03-01T14:08:53Z     return self._filtered_call(args, kwargs, cancellation_manager)
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
2022-03-01T14:08:53Z     cancellation_manager=cancellation_manager)
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
2022-03-01T14:08:53Z     ctx, args, cancellation_manager=cancellation_manager))
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 550, in call
2022-03-01T14:08:53Z     ctx=ctx)
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
2022-03-01T14:08:53Z     inputs, attrs, num_outputs)
2022-03-01T14:08:53Z tensorflow.python.framework.errors_impl.InvalidArgumentError:  indices[1,3] = -1 is not in [0, 105879)
2022-03-01T14:08:53Z 	 [[node tf_bert_model/bert/embeddings/Gather (defined at usr/local/lib/python3.6/dist-packages/transformers/models/bert/modeling_tf_bert.py:190) ]] [Op:__inference_compute_embeddings_7904]
2022-03-01T14:08:53Z 
2022-03-01T14:08:53Z Errors may have originated from an input operation.
2022-03-01T14:08:53Z Input Source operations connected to node tf_bert_model/bert/embeddings/Gather:
2022-03-01T14:08:53Z  subwords (defined at srv/wembeddings/wembeddings/wembeddings.py:68)
2022-03-01T14:08:53Z 
2022-03-01T14:08:53Z Function call stack:
2022-03-01T14:08:53Z compute_embeddings
2022-03-01T14:08:53Z 
2022-03-01T14:08:53Z 10.88.0.10 - - [01/Mar/2022 14:08:53] "POST /wembeddings HTTP/1.1" 400 -

My captured payload sent to wembeddings (before encoding to ascii bytes):
{"model": "bert-base-multilingual-uncased-last4", "sentences": [["Kdy", "slav\u00ed", "sv\u00e1tek"], ["Oto"]]}

My environment for NameTag2:
OS: MacOS 11.6.4
Container engine: Podman 3.4.4
using Dockerfile in branch nametag2 https://github.com/ufal/nametag/blob/nametag2/Dockerfile
Python version in the image: Python 3.5.2 (affected probably 3.5 and lower)

My environment for Wembeddings:
OS: MacOS 11.6.4
Container engine: Podman 3.4.4
using Dockerfile on the master branch https://github.com/ufal/wembedding_service/blob/master/Dockerfile

I discovered two more sentences which fails in the same way with the same error, but on the http://lindat.mff.cuni.cz/services/nametag/ they work normally, not sure why.
The sentences are:

  • Chci najít transakci 1133.40 USD
  • Kdy ma svatek Vanda? Co Vratislav

Note: I'm a Czech and we can continue in Czech if you would prefer it that way :-)

Server returns invalid json when output is set to "vertical" and no entities were found.

This gives me expected output:
>>> curl "localhost:8000/recognize?output=vertical&data=Vaclav"

{
 "model": "cnec2.ner",
 "acknowledgements": [
  "http://ufal.mff.cuni.cz/nametag#nametag_acknowledgements",
  "http://ufal.mff.cuni.cz/morphodita#morphodita_acknowledgements"
 ],
 "result": "8029473260223733771\t?\tVaclav\n"
}

This gives me Invalid JSON:
>>> curl "localhost:8000/recognize?output=vertical&data=xxxx"

{
 "model": "cnec2.ner",
 "acknowledgements": [
  "http://ufal.mff.cuni.cz/nametag#nametag_acknowledgements",
  "http://ufal.mff.cuni.cz/morphodita#morphodita_acknowledgements"
 ],
 "result": }

Feel free to contact me to get more information.

r wrapper / morphodita

FYI.
I've built an R wrapper around nametag https://github.com/bnosac/nametagger so that I can easily use it to construct a baseline NER model and compare it to a baseline CRF or other deep-learning approaches which require more computing resources.

While I was doing this. I'm wondering if there is an easy way on how to extract a morphodita model from a .udpipe file? Such that I can use them with tagger morphodita:model?

Invalid and incorrect JSON responses for some Python runs for Py 3.5 and lower [nametag2]

If using Python 3.5 (e.g. using the Dockerfile), for some runs of the nametag2 (REST server mode) it returns the response JSON with attributes in a different order than "model, acknowledgements, result". In case this happens the result is an incorrect response and sometimes even an invalid JSON (see examples below).
When I restart the container with the nametag server (so it stops and starts the NameTag2 server), there is a different order of the attributes in the JSON response.
If the order is truly random it means that in 2/3 of cases it has the wrong order and incorrect JSON and in 1/3 of cases, it has incorrect AND invalid JSON.

Curl to test it:
curl --location --request GET 'localhost:8001/recognize?data=Ondra&output=vertical'

Result from one run of the NameTag2 server (the correct one):

{
 "acknowledgements": [
  "http://ufal.mff.cuni.cz/nametag/2#acknowledgements",
  "ack-text"
 ],
 "result": "",
 "model": "czech-cnec2.0-2008311\tgu\tOndra\n"
}

Result from another run of the NameTag2 server (the incorrect one):

{
 "acknowledgements": [
  "http://ufal.mff.cuni.cz/nametag/2#acknowledgements",
  "ack-text"
 ],
 "model": "czech-cnec2.0-200831",
 "result": "1\tgu\tOndra\n"
}

Result from yet another run of the NameTag2 server (the incorrect and invalid one):

{
 "result": "",
 "model": "czech-cnec2.0-200831",
 "acknowledgements": [
  "http://ufal.mff.cuni.cz/nametag/2#acknowledgements",
  "ack-text"
 1\tgu\tOndra\n"
}

My environment:
OS: MacOS 11.6.4
Container engine: Podman 3.4
using Dockerfile in branch nametag2 https://github.com/ufal/nametag/blob/nametag2/Dockerfile
Python version in the image: Python 3.5.2 (affected probably 3.5 and lower)

I have tested this also outside of the Podman/container with Python 3.7.9 and I cannot replicate this bug there (further supporting my explanation below).

My understanding of the issue:
Based on https://stackoverflow.com/questions/39980323/are-dictionaries-ordered-in-python-3-6 before python 3.6 (i.e. 3.5 and lower) the dict has not guaranteed the order of iteration. But since 3.6 and newer the Python dict is insertion ordered, so it keeps the order in which the attributes were given to it.

My non-exhaustive list of possible solutions:

  • So one solution is to use at least Python 3.6 in the Dockerfile and depend on the fact, that the insertion order is always kept (note: I have tried using docker image tensorflow/tensorflow:1.15.0, but it doesn't have curl already installed and so other tweaks would have to be made, but I believe it's a feasible solution).
  • Another to use collections.OrderedDict so it works even in Python 3.5 or lower. https://docs.python.org/3/library/collections.html#collections.OrderedDict
  • But the most interesting (IMHO) solution is to stop removing the last 3 chars of the resulting JSON (at https://github.com/ufal/nametag/blob/nametag2/nametag2_server.py#L540 ) and just build it whole in one step (this has the benefit of not being confusing).
  • And there are probably other solutions.

Note: I'm a Czech and we can continue in Czech if you would prefer it that way :-)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.