ufal / nametag Goto Github PK

NameTag: Named Entity Tagger

License: Mozilla Public License 2.0

Makefile 0.52% Java 0.19% Perl 0.25% Python 0.43% C++ 92.63% HTML 2.89% Shell 0.39% C# 0.17% C 0.36% Ragel 1.79% SWIG 0.37% XS 0.01%

nametag's People

Stargazers

Watchers

Forkers

ifrag titsuki curusarn majkll bnosac-dev starenka crabhi anglisano prorock

nametag's Issues

Conditional jump or move depends on uninitialised value

Hello foxik.

In 2020 I've created an R wrapper around the nametag library. It's on CRAN since June 2020 (https://cran.r-project.org/package=nametagger).
Recently the CRAN build system has changed and the build log at CRAN shows that a valgrind issue has appeared at ufal::nametag::utils::lzma::MatchFinder_Create(

==1458363== Memcheck, a memory error detector
==1458363== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==1458363== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info
==1458363== Command: /data/blackswan/ripley/R/R-devel-vg/bin/exec/R --vanilla
==1458363== 
...
> data(europeananews)
> x <- subset(europeananews, doc_id %in% "enp_NL.kb.bio")
> traindata <- subset(x, sentence_id >  100)
> testdata  <- subset(x, sentence_id <= 100)
> path <- "nametagger-nl.ner" 
> ## Don't show: 
> path <- tempfile("nametagger-nl_", fileext = ".ner")
> traindata <- subset(x, sentence_id >  100 & sentence_id < 300)
> testdata  <- subset(x, sentence_id <= 100)
> ## End(Don't show) 
> opts <- nametagger_options(file = path,
+                            token = list(window = 2),
+                            token_normalisedsuffix = list(window = 0, from = 1, to = 4),
+                            ner_previous = list(window = 2),
+                            time = list(use = TRUE),
+                            url_email = list(url = "URL", email = "EMAIL"))
> ## Don't show: 
> model <- nametagger(x.train = traindata, x.test = testdata,
+                     iter = 1, lambda = 0.5, control = opts)
==1458363== Conditional jump or move depends on uninitialised value(s)
==1458363==    at 0x17E945C7: ufal::nametag::utils::lzma::MatchFinder_Create(ufal::nametag::utils::lzma::CMatchFinder*, unsigned int, unsigned int, unsigned int, unsigned int, ufal::nametag::utils::lzma::ISzAlloc*) (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:540)
==1458363==    by 0x17E96768: LzmaEnc_Alloc (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:2984)
==1458363==    by 0x17E96768: ufal::nametag::utils::lzma::LzmaEnc_AllocAndInit(ufal::nametag::utils::lzma::CLzmaEnc*, unsigned int, ufal::nametag::utils::lzma::ISzAlloc*, ufal::nametag::utils::lzma::ISzAlloc*) [clone .constprop.0] (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:3075)
==1458363==    by 0x17E96C1C: ufal::nametag::utils::lzma::LzmaEnc_MemEncode(void*, unsigned char*, unsigned long*, unsigned char const*, unsigned long, int, ufal::nametag::utils::lzma::ICompressProgress*, ufal::nametag::utils::lzma::ISzAlloc*, ufal::nametag::utils::lzma::ISzAlloc*) (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:3269)
==1458363==    by 0x17E96D00: ufal::nametag::utils::lzma::LzmaEncode(unsigned char*, unsigned long*, unsigned char const*, unsigned long, ufal::nametag::utils::lzma::CLzmaEncProps const*, unsigned char*, unsigned long*, int, ufal::nametag::utils::lzma::ICompressProgress*, ufal::nametag::utils::lzma::ISzAlloc*, ufal::nametag::utils::lzma::ISzAlloc*) (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:3293)
==1458363==    by 0x17E96DC3: ufal::nametag::utils::compressor::save(std::ostream&, ufal::nametag::utils::binary_encoder const&) (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:3320)
==1458363==    by 0x17E87DC6: ufal::nametag::entity_map::save(std::ostream&) const (packages/tests-vg/nametagger/src/nametag/src/ner/entity_map_encoder.cpp:24)
==1458363==    by 0x17E85846: ufal::nametag::bilou_ner_trainer::train(ufal::nametag::ner_ids::ner_id, int, ufal::nametag::network_parameters const&, ufal::nametag::tagger const&, std::istream&, std::istream&, std::istream&, std::ostream&) (packages/tests-vg/nametagger/src/nametag/src/ner/bilou_ner_trainer.cpp:71)
==1458363==    by 0x17E99241: nametag_train(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, int, double, double, double, double, int, bool, char const*) (packages/tests-vg/nametagger/src/rcpp_nametag.cpp:189)
==1458363==    by 0x17EA3C06: _nametagger_nametag_train (packages/tests-vg/nametagger/src/RcppExports.cpp:63)
==1458363==    by 0x4A3B59: R_doDotCall (svn/R-devel/src/main/dotcode.c:927)
==1458363==    by 0x4A4203: do_dotcall (svn/R-devel/src/main/dotcode.c:1551)
==1458363==    by 0x4DD026: bcEval (svn/R-devel/src/main/eval.c:7567)
==1458363==  Uninitialised value was created by a heap allocation
==1458363==    at 0x48432F3: operator new[](unsigned long) (/builddir/build/BUILD/valgrind-3.21.0/coregrind/m_replacemalloc/vg_replace_malloc.c:714)
==1458363==    by 0x17E961C7: ufal::nametag::utils::lzma::LzmaEnc_Create(ufal::nametag::utils::lzma::ISzAlloc*) (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:2769)
==1458363==    by 0x17E96C7B: ufal::nametag::utils::lzma::LzmaEncode(unsigned char*, unsigned long*, unsigned char const*, unsigned long, ufal::nametag::utils::lzma::CLzmaEncProps const*, unsigned char*, unsigned long*, int, ufal::nametag::utils::lzma::ICompressProgress*, ufal::nametag::utils::lzma::ISzAlloc*, ufal::nametag::utils::lzma::ISzAlloc*) (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:3283)
==1458363==    by 0x17E96DC3: ufal::nametag::utils::compressor::save(std::ostream&, ufal::nametag::utils::binary_encoder const&) (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:3320)
==1458363==    by 0x17E87DC6: ufal::nametag::entity_map::save(std::ostream&) const (packages/tests-vg/nametagger/src/nametag/src/ner/entity_map_encoder.cpp:24)
==1458363==    by 0x17E85846: ufal::nametag::bilou_ner_trainer::train(ufal::nametag::ner_ids::ner_id, int, ufal::nametag::network_parameters const&, ufal::nametag::tagger const&, std::istream&, std::istream&, std::istream&, std::ostream&) (packages/tests-vg/nametagger/src/nametag/src/ner/bilou_ner_trainer.cpp:71)
==1458363==    by 0x17E99241: nametag_train(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, int, double, double, double, double, int, bool, char const*) (packages/tests-vg/nametagger/src/rcpp_nametag.cpp:189)
==1458363==    by 0x17EA3C06: _nametagger_nametag_train (packages/tests-vg/nametagger/src/RcppExports.cpp:63)
==1458363==    by 0x4A3B59: R_doDotCall (svn/R-devel/src/main/dotcode.c:927)
==1458363==    by 0x4A4203: do_dotcall (svn/R-devel/src/main/dotcode.c:1551)
==1458363==    by 0x4DD026: bcEval (svn/R-devel/src/main/eval.c:7567)
==1458363==    by 0x4F595F: Rf_eval (svn/R-devel/src/main/eval.c:1146)
==1458363== 
> ## End(Don't show)

Although in the R package (at https://github.com/bnosac/nametagger) I use commit 598666b of nametag, I think the issue is still there in the current version of nametag.
Not sure how difficult it is to fix this.

Integrate with CLARIN LR Switchboard

Integrate NameTag in https://switchboard.clarin.eu similarily to how UDPipe has been integrated.

-----Original message-----
From: Ondrej Kosarko

mam jeste dotaz k nametagu. Potrebovali bychom ho pridat do switchboardu.
Switchboard uplne nepouziva rest api, ale udela redirect s predvyplnenejma parametrama. Enduser je clovek ne stroj, takze >dostane nejakou vizualizaci.
V udpipe funguje napr [ https://lindat.mff.cuni.cz/services/udpipe/?data=Test+sentence&model=eng | >https://lindat.mff.cuni.cz/services/udpipe/?data=Test+sentence&model=eng ] + myslim ze data muze byt i url souboru s >textem.

Je neco takoveho i v nametagu? Jak dlouho by to pripadne zabralo pretahnout z toho udpipe kodu?
Mimochodem jde vubec v nametagu vybrat model iso kodem jazyka?

v Nametagu nic takového zatím není. Tu podporu psal ve skutečnosti Josef,
je to v https://github.com/ufal/udpipe/blob/master/web/lindat-service/fill-using-params.js
a https://github.com/ufal/udpipe/blob/master/web/lindat-service/run.php#L262-L281 .
Přetáhnout to do NameTagu by měla být relativně přímočará záležitost
(copy-paste, zkusit, a případně malinko pošťouchat, kdyby to z nějakého
důvodu nefungovalo).

A ano, model vybrat iso kódem jazyka (naštěstí) jde.

r wrapper / morphodita

FYI.
I've built an R wrapper around nametag https://github.com/bnosac/nametagger so that I can easily use it to construct a baseline NER model and compare it to a baseline CRF or other deep-learning approaches which require more computing resources.

While I was doing this. I'm wondering if there is an easy way on how to extract a morphodita model from a .udpipe file? Such that I can use them with tagger morphodita:model?

Memory Leak in Java Binding

When I run following code in java (with -Xmx100m option to limit the heap memory), the executed process quite quickly consumes more and more RAM memory: starting at about 300MB going to 2GB in less than a minute and continues growing...

I'm quite sure that the java code is ok, so the memory leak has to be in the native C++ code. Note also that the leaked memory does not belong to the java heap because of the -Xmx100m option.

Tested no windows but similar behavior observed on Centos Linux as well.

NameTag version 1.1.1

I don't have C++ toolkit ready, so a didn't test it directly (without java), but I can add more details if needed.

import cz.cuni.mff.ufal.nametag.*;

public class RunNer {
	public static void main(String[] args) {
		Ner ner = Ner.load("target/models/czech-cnec2.0-140304.ner");

		Forms forms = new Forms();
		TokenRanges tokens = new TokenRanges();
		NamedEntities entities = new NamedEntities();
		Tokenizer tokenizer = ner.newTokenizer();

		for (int r = 0; r < 10000000; r++) {
			String text = "Václav Havel byl prezident České Republiky";
			tokenizer.setText(text);
			while (tokenizer.nextSentence(forms, tokens)) {
				ner.recognize(forms, entities);
			}
			if (r % 10000 == 0)
				System.err.println(r);
		}
	}
}

Nametag REST server fails when compiled in debug mode

Reproduce the bug:

Change compilation mode to debug: Edit line 81 in Makefile.builtem from: MODE=normal to MODE=debug
Recompile the server: run make server
Run the server
Send any recognize request to server URL:PORT/recognize?data=whatever

Output:

Successfully started nametag_server on port 8080.
/usr/include/c++/4.9/debug/vector:357:error: attempt to subscript container 
    with out-of-bounds index 0, but container only holds 0 elements.

Objects involved in the operation:
sequence "this" @ 0x0x7f76200091b0 {
  type = NSt7__debug6vectorINS0_IjSaIjEEESaIS2_EEE;
}
Aborted

Best regards,

Simon Let

Unexpected category in czech-cnec2.0-200831 model

For this sentence:

Dobrý den, dámy a pánové, já bych si dovolil ještě navrhnout jednu změnu v pevném zařazení, a to konkrétně v bodu 68, sněmovní tisk 51, Výroční zprávy a účetní závěrky zdravotních pojišťoven za rok 2012, a to na pátek 14. 2. po bloku třetích čtení.

nametag returns unexpected category C - Bibliography container, this category is not defined in https://ufal.mff.cuni.cz/~strakova/cnec2.0/ne-type-hierarchy.pdf

std::iterator deprecated in C++17

The R package nametagger wraps the nametag C++ library.
The default on R is now to compile using C++17 instead of C++11.
As std::iterator is deprecated in C++17, would it be possible to update the library such that it can be compiled without C++17 warnings.

Found the following significant warnings:
  ./nametag/src/unilib/utf8.h:148:52: warning: 'template<class _Category, class _Tp, class _Distance, class _Pointer, class _Reference> struct std::iterator' is deprecated [-Wdeprecated-declarations]
  ./nametag/src/unilib/utf8.h:180:52: warning: 'template<class _Category, class _Tp, class _Distance, class _Pointer, class _Reference> struct std::iterator' is deprecated [-Wdeprecated-declarations]
  nametag/src/unilib/utf16.h:116:53: warning: 'template<class _Category, class _Tp, class _Distance, class _Pointer, class _Reference> struct std::iterator' is deprecated [-Wdeprecated-declarations]
  nametag/src/unilib/utf16.h:148:53: warning: 'template<class _Category, class _Tp, class _Distance, class _Pointer, class _Reference> struct std::iterator' is deprecated [-Wdeprecated-declarations]
  nametag/src/unilib/utf8.h:148:52: warning: 'template<class _Category, class _Tp, class _Distance, class _Pointer, class _Reference> struct std::iterator' is deprecated [-Wdeprecated-declarations]
  nametag/src/unilib/utf8.h:180:52: warning: 'template<class _Category, class _Tp, class _Distance, class _Pointer, class _Reference> struct std::iterator' is deprecated [-Wdeprecated-declarations]

Why can't two words have same brown cluster representation?

When I run train_ner with BrownClusters feature enabled I get following output:

Loading train data: done, 8158 sentences
Loading heldout data: done, 899 sentences
Parsing feature templates: Form '0000000' is present twice in Brown cluster file 'clusters/cs_brown_1000'!
Cannot initialize feature template sentence processor 'BrownClusters' from line 'BrownClusters/2 clusters/cs_brown_1000' of feature templates file!

Why exactly can't be Form '0000000' present twice?
It seems like a unnecessary limitation. As far as I know all words with the same prefix belong into one cluster. Therefore any additional bits after chosen prefix are irrelevant.
(Eg. with prefix of length 20 any bits after 20th bit are irrelevant.)
Am I missing something?

Best regards.
Simon Let

NameTag2 returns code 400 + internal error for specific sentences [nametag2]

For sentence Kdy slaví svátek Oto NameTag2 returns status code 400 and response: An internal error occurred during processing.
This happens on localhost for all output types. Based on the stacktrace, the error occures because wembeddings throws also an error (see below).

Curl to test it curl --location --request GET 'localhost:8001/recognize?data=Kdy slaví svátek Oto&output=vertical'

On the http://lindat.mff.cuni.cz/services/nametag/ it behaves a bit differently, as the sentence Kdy slaví svátek Oto works (=returns some result) for all output modes except vertical.

NameTag2 log:

2022-03-01T14:08:53Z Traceback (most recent call last):
2022-03-01T14:08:53Z   File "nametag2_server.py", line 521, in do_GET
2022-03-01T14:08:53Z     output = model.predict(output)
2022-03-01T14:08:53Z   File "nametag2_server.py", line 174, in predict
2022-03-01T14:08:53Z     self.network.predict("test", dataset, self.args, output, evaluating=False)
2022-03-01T14:08:53Z   File "/srv/nametag/nametag2_network.py", line 387, in predict
2022-03-01T14:08:53Z     batch_dict = dataset.next_batch(args.batch_size, including_charseqs=args.including_charseqs, seq2seq=seq2seq)
2022-03-01T14:08:53Z   File "/srv/nametag/nametag2_dataset.py", line 223, in next_batch
2022-03-01T14:08:53Z     return self._next_batch(batch_perm, including_charseqs, seq2seq)
2022-03-01T14:08:53Z   File "/srv/nametag/nametag2_dataset.py", line 304, in _next_batch
2022-03-01T14:08:53Z     for i, embeddings in enumerate(self._bert.compute_embeddings("bert-base-multilingual-uncased-last4", batch_sentences)):
2022-03-01T14:08:53Z   File "/srv/nametag/wembedding_service/wembeddings/wembeddings.py", line 168, in compute_embeddings
2022-03-01T14:08:53Z     data=json.dumps({"model": model, "sentences": sentences}, ensure_ascii=True).encode("ascii"),
2022-03-01T14:08:53Z   File "/usr/lib/python3.5/urllib/request.py", line 163, in urlopen
2022-03-01T14:08:53Z     return opener.open(url, data, timeout)
2022-03-01T14:08:53Z   File "/usr/lib/python3.5/urllib/request.py", line 472, in open
2022-03-01T14:08:53Z     response = meth(req, response)
2022-03-01T14:08:53Z   File "/usr/lib/python3.5/urllib/request.py", line 582, in http_response
2022-03-01T14:08:53Z     'http', request, response, code, msg, hdrs)
2022-03-01T14:08:53Z   File "/usr/lib/python3.5/urllib/request.py", line 510, in error
2022-03-01T14:08:53Z     return self._call_chain(*args)
2022-03-01T14:08:53Z   File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain
2022-03-01T14:08:53Z     result = func(*args)
2022-03-01T14:08:53Z   File "/usr/lib/python3.5/urllib/request.py", line 590, in http_error_default
2022-03-01T14:08:53Z     raise HTTPError(req.full_url, code, msg, hdrs, fp)
2022-03-01T14:08:53Z urllib.error.HTTPError: HTTP Error 400: Bad Request
2022-03-01T14:08:53Z 10.88.0.9 - - [01/Mar/2022 14:08:53] "GET /recognize?data=Kdy%20slav%C3%AD%20sv%C3%A1tek%20Oto&output=vertical HTTP/1.1" 400 -

Wembeddings error:

2022-03-01T14:08:53Z Traceback (most recent call last):
2022-03-01T14:08:53Z   File "/srv/wembeddings/wembeddings/wembeddings_server.py", line 67, in do_POST
2022-03-01T14:08:53Z     sentences_embeddings = request.server._wembeddings.compute_embeddings(model, sentences)
2022-03-01T14:08:53Z   File "/srv/wembeddings/wembeddings/wembeddings.py", line 143, in compute_embeddings
2022-03-01T14:08:53Z     embeddings_with_parts = model.compute_embeddings(np_subwords, np_segments).numpy()
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1655, in __call__
2022-03-01T14:08:53Z     return self._call_impl(args, kwargs)
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1665, in _call_impl
2022-03-01T14:08:53Z     cancellation_manager)
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1745, in _call_with_structured_signature
2022-03-01T14:08:53Z     return self._filtered_call(args, kwargs, cancellation_manager)
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
2022-03-01T14:08:53Z     cancellation_manager=cancellation_manager)
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
2022-03-01T14:08:53Z     ctx, args, cancellation_manager=cancellation_manager))
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 550, in call
2022-03-01T14:08:53Z     ctx=ctx)
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
2022-03-01T14:08:53Z     inputs, attrs, num_outputs)
2022-03-01T14:08:53Z tensorflow.python.framework.errors_impl.InvalidArgumentError:  indices[1,3] = -1 is not in [0, 105879)
2022-03-01T14:08:53Z 	 [[node tf_bert_model/bert/embeddings/Gather (defined at usr/local/lib/python3.6/dist-packages/transformers/models/bert/modeling_tf_bert.py:190) ]] [Op:__inference_compute_embeddings_7904]
2022-03-01T14:08:53Z 
2022-03-01T14:08:53Z Errors may have originated from an input operation.
2022-03-01T14:08:53Z Input Source operations connected to node tf_bert_model/bert/embeddings/Gather:
2022-03-01T14:08:53Z  subwords (defined at srv/wembeddings/wembeddings/wembeddings.py:68)
2022-03-01T14:08:53Z 
2022-03-01T14:08:53Z Function call stack:
2022-03-01T14:08:53Z compute_embeddings
2022-03-01T14:08:53Z

Wembeddings full log (with errors from TensorFlow at the beginning):

2022-03-01T14:07:49Z OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k
2022-03-01T14:07:50Z 2022-03-01 14:07:50.930771: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2022-03-01T14:07:50Z 2022-03-01 14:07:50.930932: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-03-01T14:07:53Z Starting WEmbeddings server on port 8000.
2022-03-01T14:07:53Z To stop it gracefully, either send SIGINT (Ctrl+C) or SIGUSR1.
2022-03-01T14:08:29Z 2022-03-01 14:08:29.779715: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-03-01T14:08:29Z 2022-03-01 14:08:29.779901: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2022-03-01T14:08:29Z 2022-03-01 14:08:29.780000: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (4e281229591c): /proc/driver/nvidia/version does not exist
2022-03-01T14:08:29Z 2022-03-01 14:08:29.780756: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
2022-03-01T14:08:29Z To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-01T14:08:29Z 2022-03-01 14:08:29.794271: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2591965000 Hz
2022-03-01T14:08:29Z 2022-03-01 14:08:29.795785: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fa6bc155490 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-03-01T14:08:29Z 2022-03-01 14:08:29.796253: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2022-03-01T14:08:42Z Some layers from the model checkpoint at bert-base-multilingual-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
2022-03-01T14:08:42Z - This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
2022-03-01T14:08:42Z - This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
2022-03-01T14:08:42Z All the layers of TFBertModel were initialized from the model checkpoint at bert-base-multilingual-uncased.
2022-03-01T14:08:42Z If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.
2022-03-01T14:08:52Z WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py:574: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
2022-03-01T14:08:52Z Instructions for updating:
2022-03-01T14:08:52Z Use fn_output_signature instead
2022-03-01T14:08:53Z Traceback (most recent call last):
2022-03-01T14:08:53Z   File "/srv/wembeddings/wembeddings/wembeddings_server.py", line 67, in do_POST
2022-03-01T14:08:53Z     sentences_embeddings = request.server._wembeddings.compute_embeddings(model, sentences)
2022-03-01T14:08:53Z   File "/srv/wembeddings/wembeddings/wembeddings.py", line 143, in compute_embeddings
2022-03-01T14:08:53Z     embeddings_with_parts = model.compute_embeddings(np_subwords, np_segments).numpy()
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1655, in __call__
2022-03-01T14:08:53Z     return self._call_impl(args, kwargs)
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1665, in _call_impl
2022-03-01T14:08:53Z     cancellation_manager)
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1745, in _call_with_structured_signature
2022-03-01T14:08:53Z     return self._filtered_call(args, kwargs, cancellation_manager)
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
2022-03-01T14:08:53Z     cancellation_manager=cancellation_manager)
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
2022-03-01T14:08:53Z     ctx, args, cancellation_manager=cancellation_manager))
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 550, in call
2022-03-01T14:08:53Z     ctx=ctx)
2022-03-01T14:08:53Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
2022-03-01T14:08:53Z     inputs, attrs, num_outputs)
2022-03-01T14:08:53Z tensorflow.python.framework.errors_impl.InvalidArgumentError:  indices[1,3] = -1 is not in [0, 105879)
2022-03-01T14:08:53Z 	 [[node tf_bert_model/bert/embeddings/Gather (defined at usr/local/lib/python3.6/dist-packages/transformers/models/bert/modeling_tf_bert.py:190) ]] [Op:__inference_compute_embeddings_7904]
2022-03-01T14:08:53Z 
2022-03-01T14:08:53Z Errors may have originated from an input operation.
2022-03-01T14:08:53Z Input Source operations connected to node tf_bert_model/bert/embeddings/Gather:
2022-03-01T14:08:53Z  subwords (defined at srv/wembeddings/wembeddings/wembeddings.py:68)
2022-03-01T14:08:53Z 
2022-03-01T14:08:53Z Function call stack:
2022-03-01T14:08:53Z compute_embeddings
2022-03-01T14:08:53Z 
2022-03-01T14:08:53Z 10.88.0.10 - - [01/Mar/2022 14:08:53] "POST /wembeddings HTTP/1.1" 400 -

My captured payload sent to wembeddings (before encoding to ascii bytes):
{"model": "bert-base-multilingual-uncased-last4", "sentences": [["Kdy", "slav\u00ed", "sv\u00e1tek"], ["Oto"]]}

My environment for NameTag2:
OS: MacOS 11.6.4
Container engine: Podman 3.4.4
using Dockerfile in branch nametag2 https://github.com/ufal/nametag/blob/nametag2/Dockerfile
Python version in the image: Python 3.5.2 (affected probably 3.5 and lower)

My environment for Wembeddings:
OS: MacOS 11.6.4
Container engine: Podman 3.4.4
using Dockerfile on the master branch https://github.com/ufal/wembedding_service/blob/master/Dockerfile

I discovered two more sentences which fails in the same way with the same error, but on the http://lindat.mff.cuni.cz/services/nametag/ they work normally, not sure why.
The sentences are:

Chci najít transakci 1133.40 USD
Kdy ma svatek Vanda? Co Vratislav

Note: I'm a Czech and we can continue in Czech if you would prefer it that way :-)

Duplicity rows in nametag output

There are duplicated rows for this input:

Směrnice Evropského parlamentu a Rady číslo 98/70 ES o jakosti benzinu a motorové nafty stanoví závazný cíl dosáhnout do roku 2020 6 % snížení emisí skleníkových plynů z pohonných hmot v porovnání s rokem 2010.

...skipping...

Missing category

(probably related with #14 (comment))
For this input:

Nic neříká.

nametag return entity without category:

Wrong token ranges when sentences are in vertical input

Nametag gives wrong result for vertical input and output when sentences are split with new line

left: No double endlines at input
right: double endlines split sentences

It seems that overflow happen every thousand tokens:
10-times bigger input:

Enhancement: Accept data from request body

Currently, data has to be sent as part of the URL.
Which is not ideal because nametag comfortably handles large requests (thousands of characters) but it is not standard to use URLs longer than a few thousands of characters.

I propose to accept data from request body instead or in addition to accepting data from URL.

Server returns invalid json when output is set to "vertical" and no entities were found.

This gives me expected output:
>>> curl "localhost:8000/recognize?output=vertical&data=Vaclav"

{
 "model": "cnec2.ner",
 "acknowledgements": [
  "http://ufal.mff.cuni.cz/nametag#nametag_acknowledgements",
  "http://ufal.mff.cuni.cz/morphodita#morphodita_acknowledgements"
 ],
 "result": "8029473260223733771\t?\tVaclav\n"
}

This gives me Invalid JSON:
>>> curl "localhost:8000/recognize?output=vertical&data=xxxx"

{
 "model": "cnec2.ner",
 "acknowledgements": [
  "http://ufal.mff.cuni.cz/nametag#nametag_acknowledgements",
  "http://ufal.mff.cuni.cz/morphodita#morphodita_acknowledgements"
 ],
 "result": }

Feel free to contact me to get more information.

Python bindings doesn't work with Python 3.8+

When I tried to install ufal.nametag I got the following error with Python 3.8 (with Python 3.7 it works fine).

    Running setup.py install for ufal.nametag: started
    Running setup.py install for ufal.nametag: finished with status 'error'
    ERROR: Command errored out with exit status 1:
     command: /srv/venv/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-afntpbwc/ufal-nametag/setup.py'"'"'; __file__='"'"'/tmp/pip-install-afntpbwc/ufal-nametag/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-bzanm4ur/install-record.txt --single-version-externally-managed --compile --install-headers /srv/venv/include/site/python3.8/ufal.nametag
         cwd: /tmp/pip-install-afntpbwc/ufal-nametag/
    Complete output (23 lines):
    running install
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-3.8
    creating build/lib.linux-x86_64-3.8/ufal
    copying ufal/__init__.py -> build/lib.linux-x86_64-3.8/ufal
    copying ufal/nametag.py -> build/lib.linux-x86_64-3.8/ufal
    running build_ext
    building 'ufal_nametag' extension
    creating build/temp.linux-x86_64-3.8
    creating build/temp.linux-x86_64-3.8/nametag
    gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -Inametag/include -I/srv/venv/include -I/usr/local/include/python3.8 -c nametag/nametag.cpp -o build/temp.linux-x86_64-3.8/nametag/nametag.o -std=c++11 -fvisibility=hidden -w
    gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -Inametag/include -I/srv/venv/include -I/usr/local/include/python3.8 -c nametag/nametag_python.cpp -o build/temp.linux-x86_64-3.8/nametag/nametag_python.o -std=c++11 -fvisibility=hidden -w
    nametag/nametag_python.cpp: In function ‘void SwigPyStaticVar_dealloc(PyDescrObject*)’:
    nametag/nametag_python.cpp:3321:3: error: ‘_PyObject_GC_UNTRACK’ was not declared in this scope
       _PyObject_GC_UNTRACK(descr);
       ^~~~~~~~~~~~~~~~~~~~
    nametag/nametag_python.cpp:3321:3: note: suggested alternative: ‘PyObject_GC_UnTrack’
       _PyObject_GC_UNTRACK(descr);
       ^~~~~~~~~~~~~~~~~~~~
       PyObject_GC_UnTrack
    error: command 'gcc' failed with exit status 1
    ----------------------------------------

I found a similar issue in UDPipe that is probably solved by commit 86f1c8c171. Maybe it could help ;-)

Invalid and incorrect JSON responses for some Python runs for Py 3.5 and lower [nametag2]

If using Python 3.5 (e.g. using the Dockerfile), for some runs of the nametag2 (REST server mode) it returns the response JSON with attributes in a different order than "model, acknowledgements, result". In case this happens the result is an incorrect response and sometimes even an invalid JSON (see examples below).
When I restart the container with the nametag server (so it stops and starts the NameTag2 server), there is a different order of the attributes in the JSON response.
If the order is truly random it means that in 2/3 of cases it has the wrong order and incorrect JSON and in 1/3 of cases, it has incorrect AND invalid JSON.

Curl to test it:
curl --location --request GET 'localhost:8001/recognize?data=Ondra&output=vertical'

Result from one run of the NameTag2 server (the correct one):

{
 "acknowledgements": [
  "http://ufal.mff.cuni.cz/nametag/2#acknowledgements",
  "ack-text"
 ],
 "result": "",
 "model": "czech-cnec2.0-2008311\tgu\tOndra\n"
}

Result from another run of the NameTag2 server (the incorrect one):

{
 "acknowledgements": [
  "http://ufal.mff.cuni.cz/nametag/2#acknowledgements",
  "ack-text"
 ],
 "model": "czech-cnec2.0-200831",
 "result": "1\tgu\tOndra\n"
}

Result from yet another run of the NameTag2 server (the incorrect and invalid one):

{
 "result": "",
 "model": "czech-cnec2.0-200831",
 "acknowledgements": [
  "http://ufal.mff.cuni.cz/nametag/2#acknowledgements",
  "ack-text"
 1\tgu\tOndra\n"
}

My environment:
OS: MacOS 11.6.4
Container engine: Podman 3.4
using Dockerfile in branch nametag2 https://github.com/ufal/nametag/blob/nametag2/Dockerfile
Python version in the image: Python 3.5.2 (affected probably 3.5 and lower)

I have tested this also outside of the Podman/container with Python 3.7.9 and I cannot replicate this bug there (further supporting my explanation below).

My understanding of the issue:
Based on https://stackoverflow.com/questions/39980323/are-dictionaries-ordered-in-python-3-6 before python 3.6 (i.e. 3.5 and lower) the dict has not guaranteed the order of iteration. But since 3.6 and newer the Python dict is insertion ordered, so it keeps the order in which the attributes were given to it.

My non-exhaustive list of possible solutions:

So one solution is to use at least Python 3.6 in the Dockerfile and depend on the fact, that the insertion order is always kept (note: I have tried using docker image tensorflow/tensorflow:1.15.0, but it doesn't have curl already installed and so other tweaks would have to be made, but I believe it's a feasible solution).
Another to use collections.OrderedDict so it works even in Python 3.5 or lower. https://docs.python.org/3/library/collections.html#collections.OrderedDict
But the most interesting (IMHO) solution is to stop removing the last 3 chars of the resulting JSON (at https://github.com/ufal/nametag/blob/nametag2/nametag2_server.py#L540 ) and just build it whole in one step (this has the benefit of not being confusing).
And there are probably other solutions.

Note: I'm a Czech and we can continue in Czech if you would prefer it that way :-)

Support Tensorflow 2.x and Python 3.8+ in NameTag 2

Hi, this is not an urgent feature request, but sooner or later, it would be great to see this done.

As NameTag 2 is using Tensorflow 1.x, it needs Python 3.7 (or older). But since Python 3.7 will reach its end of life soon (27 Jun 2023 according to https://endoflife.date/python ) and since it blocks using newer Python versions, it would be awesome if someone could migrate the code from TF 1.x. to 2.x. There is even an official migration manual on TF pages: https://www.tensorflow.org/guide/migrate .
Another solution would be to release NameTag 3 written in TF 2.x (or other language).

Server returns invalid json when there is no data.

This is different bug than #6

These two calls give me Invalid JSON:
>>> curl "localhost:8000/recognize?data="
>>> curl "localhost:8000/recognize?output=vertical&data="

{
 "model": "cnec2.ner",
 "acknowledgements": [
  "http://ufal.mff.cuni.cz/nametag#nametag_acknowledgements",
  "http://ufal.mff.cuni.cz/morphodita#morphodita_acknowledgements"
 ],
 "result": }