dwadden / dygiepp Goto Github PK

View Code? Open in Web Editor NEW

565.0 565.0 119.0 1.08 MB

Span-based system for named entity, relation, and event extraction.

License: MIT License

Python 82.99% Jsonnet 1.77% Shell 2.26% Perl 1.13% Dockerfile 0.58% Jupyter Notebook 11.27%

dygiepp's People

Contributors

Stargazers

Watchers

Forkers

zxlzr jibin5167 ulmewennberg foxlf823 xumeng123 jingrongfeng lixianyao shiqing1234 bhargaviparanjape matteocannaviccio limteng-rpi ammieqi schmidek johns-nlmatics dav009 wesley-weiming mishidemudong geekiac bflashcp3f larrylee-bd aidaamini yiweijiang2015 zhangqixun kennyweichen ndobb mylv1222 oahihs gillesj khuangaf siviltaram hitercs jankim piaofu110 aviy7 gnvramanarao tianyi-chen xiongjun19 watsonwangzh fe1ixxu eric9612 chaitanya2334 kailiu-leo zzsfornlp null-op w2wei huangtc2022 dashuang13 altescy astenuz serenalotreck anranhao fagan2888 cytsinghua zhihao-chen cchengz sarathismg yuanhaoliu maoyingmy yangp725 cyriltw koanatakiyo itonly generalcommission shudonglu-bupt josephhaaga gangzhao98 dzz1th leoliner kabongosalomon sheller2010 yangyuxino bar-ta jiayuang siyeonnn aka-zyq yasark staskh anranh97 aumpandya95 rocke2020 bianlingfeng2018 xjthm2019 xiaoanshi unlimit11 e3oroush hlee-top km269 johnson7788 itsmemala zhangqi-here torgbuiedunyenyo yotofu samlee2015jp fusion-research techthiyanes sofyc tomhoper lcparsons adamdejl eq-zhen

dygiepp's Issues

_normalize_word method in IEJsonReader class

Hi dwadden,

thank you very much for uploading your code in github.
I have succeeded in running and training your model in my env following your instruction.

I have just noticed that _normalized_word method does NOT take "self" argument.
Given the model seems working, so it is not a big deal, but just let me report this for your reference.

How to use a locally downloaded "roberta-base" model?

When I use ace05-relation.tar.gz to predict on my own dataset, the following error message occurs:
Downloading: 4%|▍ | 20.7M/501M [30:35<108:46:45, 1.23kB/s]
Downloading: 4%|▍ | 20.7M/501M [30:45<101:02:20, 1.32kB/s]
Downloading: 4%|▍ | 20.7M/501M [31:42<211:21:11, 632B/s]
Downloading: 4%|▍ | 20.7M/501M [33:21<388:42:18, 343B/s]
Downloading: 4%|▍ | 20.7M/501M [34:21<418:01:17, 319B/s]
Downloading: 4%|▍ | 20.8M/501M [34:46<355:17:32, 376B/s]
Downloading: 4%|▍ | 20.8M/501M [34:49<255:20:31, 523B/s]
Downloading: 4%|▍ | 20.8M/501M [34:52<185:06:53, 721B/s]
Downloading: 4%|▍ | 20.8M/501M [34:55<138:31:05, 963B/s]
Downloading: 4%|▍ | 20.8M/501M [34:59<106:28:06, 1.25kB/s]("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
2020-12-26 21:32:33,238 - INFO - allennlp.models.archival - removing temporary unarchived model dir at /tmp/tmporiove4f
Traceback (most recent call last):
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/urllib3/response.py", line 438, in _error_catcher
yield
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/urllib3/response.py", line 519, in read
data = self._fp.read(amt) if not fp_closed else b""
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/http/client.py", line 461, in read
n = self.readinto(b)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/http/client.py", line 505, in readinto
n = self.fp.readinto(b)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/ssl.py", line 1071, in recv_into
return self.read(nbytes, buffer)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/ssl.py", line 929, in read
return self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/requests/models.py", line 753, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/urllib3/response.py", line 576, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/urllib3/response.py", line 541, in read
raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/contextlib.py", line 130, in exit
self.gen.throw(type, value, traceback)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/urllib3/response.py", line 455, in _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/transformers/modeling_utils.py", line 926, in from_pretrained
local_files_only=local_files_only,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/transformers/file_utils.py", line 1007, in cached_path
local_files_only=local_files_only,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/transformers/file_utils.py", line 1216, in get_from_cache
http_get(url_to_download, temp_file, proxies=proxies, resume_size=resume_size, user_agent=user_agent)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/transformers/file_utils.py", line 1088, in http_get
for chunk in r.iter_content(chunk_size=1024):
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/requests/models.py", line 756, in generate
raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/wangpancheng/anaconda3/envs/dygie_ace/bin/allennlp", line 8, in
sys.exit(run())
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/main.py", line 34, in run
main(prog="allennlp")
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/commands/init.py", line 118, in main
args.func(args)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/commands/predict.py", line 205, in _predict
predictor = _get_predictor(args)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/commands/predict.py", line 110, in _get_predictor
overrides=args.overrides,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/models/archival.py", line 208, in load_archive
model = _load_model(config.duplicate(), weights_path, serialization_dir, cuda_device)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/models/archival.py", line 246, in _load_model
cuda_device=cuda_device,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/models/model.py", line 406, in load
return model_class._load(config, serialization_dir, weights_file, cuda_device)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/models/model.py", line 305, in _load
vocab=vocab, params=model_params, serialization_dir=serialization_dir
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 601, in from_params
**extras,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 629, in from_params
kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 200, in create_kwargs
cls.name, param_name, annotation, param.default, params, **extras
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 307, in pop_and_construct_arg
return construct_arg(class_name, name, popped_params, annotation, default, **extras)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 341, in construct_arg
return annotation.from_params(params=popped_params, **subextras)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 601, in from_params
**extras,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 629, in from_params
kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 200, in create_kwargs
cls.name, param_name, annotation, param.default, params, **extras
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 307, in pop_and_construct_arg
return construct_arg(class_name, name, popped_params, annotation, default, **extras)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 388, in construct_arg
**extras,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 341, in construct_arg
return annotation.from_params(params=popped_params, **subextras)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 601, in from_params
**extras,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 631, in from_params
return constructor_to_call(**kwargs) # type: ignore
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/modules/token_embedders/pretrained_transformer_mismatched_embedder.py", line 65, in init
transformer_kwargs=transformer_kwargs,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/modules/token_embedders/pretrained_transformer_embedder.py", line 79, in init
**(transformer_kwargs or {}),
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/cached_transformers.py", line 86, in get
**kwargs,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/transformers/models/auto/modeling_auto.py", line 656, in from_pretrained
pretrained_model_name_or_path, *model_args, config=config, **kwargs
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/transformers/modeling_utils.py", line 935, in from_pretrained
raise EnvironmentError(msg)
OSError: Can't load weights for 'roberta-base'. Make sure that:

'roberta-base' is a correct model identifier listed on 'https://huggingface.co/models'
or 'roberta-base' is the correct path to a directory containing a file named one of pytorch_model.bin, tf_model.h5, model.ckpt.

It seems the model will download the pretrained roberta-base model online, but because the Internet download speed is slow, the model cannot be downloaded successfully.
So I wonder how to use a locally downloaded roberta-base model? @dwadden Thanks.

Training for ACE Dataset

Hi,
When are you going to release the code for training on ACE dataset?
Thanks

Training on WLP dataset?

Hi,
Are you planning to release the code for training a model on the WLP corpus as well?

Thanks!

Apply on Roles & Triggers across sentences.

Hi, I'd like to apply DyGIE++ on the Roles Across Multiple Sentences (RAMS) dataset.

In the RAMS dataset, the event triggers and arguments may be in separate sentences. For example, the trigger could be in sentence 3, but the victim and killer is on sentence 4.

But looking at data.md, it seems like the data format is required to have the trigger and arguments in the same sentence. Is DyGIE++ capable to processing event extraction across sentences?

Dcoker Build Fail

I cloned this repository from master(main) branch, but I saw the following error message.
=```

[internal] load build definition from Dockerfile 0.2s
=> => transferring dockerfile: 3.33kB 0.1s
=> [internal] load .dockerignore 0.2s
=> => transferring context: 443B 0.1s
=> [internal] load metadata for docker.io/pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel 6.2s
=> [auth] pytorch/pytorch:pull token for registry-1.docker.io 0.0s
=> CANCELED [ 1/29] FROM docker.io/pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel@sha256:ccebb46f954b1d32a4700aaeae0e24bd68653f92c6f276a608bf592b660b63d7 122.3s
=> => resolve docker.io/pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel@sha256:ccebb46f954b1d32a4700aaeae0e24bd68653f92c6f276a608bf592b660b63d7 0.0s
=> => sha256:bb833e4d631feff31ab57559d64617ad895d3ae7f45fdb651f9ba2df50b183b7 10.06kB / 10.06kB 0.0s
=> => sha256:8c3b70e3904492c753652606df4726430426f42ea56e06ea924d6fea7ae162a1 845B / 845B 0.0s
=> => sha256:ee9b457b77d047ff322858e2de025e266ff5908aec569560e77e2e4451fc23f4 184B / 184B 0.0s
=> => sha256:ccebb46f954b1d32a4700aaeae0e24bd68653f92c6f276a608bf592b660b63d7 3.05kB / 3.05kB 0.0s
=> => sha256:c1bbdc448b7263673926b8fe2e88491e5083a8b4b06ddfabf311f2fc5f27e2ff 35.36kB / 35.36kB 0.0s
=> => sha256:7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c 26.69MB / 26.69MB 0.0s
=> => sha256:45d437916d5781043432f2d72608049dcf74ddbd27daa01a25fa63c8f1b9adc4 162B / 162B 0.0s
=> => sha256:d8f1569ddae616589c5a2dabf668fadd250ee9d89253ef16f0cb0c8a9459b322 7.22MB / 7.22MB 0.0s
=> => sha256:85386706b02069c58ffaea9de66c360f9d59890e56f58485d05c1a532ca30db1 8.45MB / 8.45MB 0.0s
=> => sha256:be4f3343ecd31ebf7ec8809f61b1d36c2c2f98fc4e63582401d9108575bc443a 77.59MB / 688.74MB 123.2s
=> => sha256:30b4effda4fdab95ec4eba8873f86e7574c2edddf4dc5df8212e3eda1545aafa 90.18MB / 820.84MB 123.2s
=> => sha256:b398e882f4149bf61faa8f2c1d47a4fe98b8fe1b2c9379da1d58ddc54fe67cf0 110.10MB / 532.41MB 123.2s
=> => extracting sha256:7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c 15.6s
=> => extracting sha256:c1bbdc448b7263673926b8fe2e88491e5083a8b4b06ddfabf311f2fc5f27e2ff 0.0s
=> => extracting sha256:8c3b70e3904492c753652606df4726430426f42ea56e06ea924d6fea7ae162a1 0.0s
=> => extracting sha256:45d437916d5781043432f2d72608049dcf74ddbd27daa01a25fa63c8f1b9adc4 0.0s
=> => extracting sha256:d8f1569ddae616589c5a2dabf668fadd250ee9d89253ef16f0cb0c8a9459b322 5.3s
=> => extracting sha256:85386706b02069c58ffaea9de66c360f9d59890e56f58485d05c1a532ca30db1 2.7s
=> => extracting sha256:ee9b457b77d047ff322858e2de025e266ff5908aec569560e77e2e4451fc23f4 0.0s
=> [internal] load build context 122.3s
=> => transferring context: 903.71MB 122.1s
=> CACHED [ 2/29] RUN mkdir /dygiepp 0.0s
=> CACHED [ 3/29] RUN apt-get update && apt-get -y install gcc make sqlite3 0.0s
=> CACHED [ 4/29] RUN conda create --name dygiepp python=3.7 -y 0.0s
=> CACHED [ 5/29] RUN conda install -c conda-forge jsonnet -y 0.0s
=> CACHED [ 6/29] COPY requirements.txt /tmp/requirements.txt 0.0s
=> CACHED [ 7/29] RUN pip install -r /tmp/requirements.txt 0.0s
=> CACHED [ 8/29] RUN conda create --name ace-event-preprocess python=3.7 -y 0.0s
=> CACHED [ 9/29] COPY scripts/data/ace-event/requirements.txt /tmp/ace-prep-requirements.txt 0.0s
=> CACHED [10/29] RUN pip install -r /tmp/ace-prep-requirements.txt 0.0s
=> CACHED [11/29] RUN python -m spacy download en 0.0s
=> CACHED [12/29] RUN apt-get install openjdk-8-jdk openjdk-8-jre wget unzip -y 0.0s
=> CACHED [13/29] COPY scripts/data/ace05/get_corenlp.sh /tmp/get_corenlp.sh 0.0s
=> CACHED [14/29] RUN cd /dygiepp/ && bash /tmp/get_corenlp.sh 0.0s
=> CACHED [15/29] RUN conda install -c conda-forge zsh -y 0.0s
=> CACHED [16/29] RUN apt-get install unzip wget -y 0.0s
=> CACHED [17/29] COPY scripts/data/shared /dygiepp/scripts/data/shared 0.0s
=> CACHED [18/29] COPY scripts/data/get_scierc.sh /tmp/get_scierc.sh 0.0s
=> CACHED [19/29] COPY dygie /dygiepp/dygie 0.0s
=> CACHED [20/29] RUN cd /dygiepp && bash /tmp/get_scierc.sh 0.0s
=> CACHED [21/29] RUN apt-get install wget -y 0.0s
=> CACHED [22/29] COPY scripts/pretrained/get_dygiepp_pretrained.sh /tmp/get_dygiepp_pretrained.sh 0.0s
=> CACHED [23/29] RUN cd /dygiepp && bash /tmp/get_dygiepp_pretrained.sh 0.0s
=> ERROR [24/29] COPY scripts/pretrained/get_scibert.py /tmp/get_scibert.py 0.0s

[24/29] COPY scripts/pretrained/get_scibert.py /tmp/get_scibert.py:

failed to compute cache key: "/scripts/pretrained/get_scibert.py" not found: not found


Looking forward to hearing from you soon!

no module named 'torch__.C'

Hi
Thank you for your amazing work and for publishing the code!

While replicating your work on making predictions on the existing dataset I encountered the following error: can you please help me out?

allennlp predict ./scripts/pretrained/genia-lightweight.tar.gz \ ./scripts/processed_data/json-coref-ident-only/test.json \ --predictor dygie \ --include-package dygie \ --use-dataset-reader \ --output-file predictions/genia-test.jsonl \ --cuda-device 0

Thank you!

Questions about dataset preprocessing

In the documentation, there is two dataset preprocessing steps. One for entity and relations and the second one is for events. In the first task, Stanford Corenlp is used, but in the second task, Spacy is used. Can you please explain, what is the difference? I see relation labels are different in these preprocessing steps, such as, "ORG-AFF.Membership" or "GEN-AFF" and their offset values are different too. There are other differences too. It would be helpful if you provide some details.

Since ACE05 is a benchmark dataset, I assume, token/entity/relation/event annotation is already there. Then why do you need Corenlp or Spacy libraries?

Can not reproduce SciERC results reported in paper

Hello,

Thanks for great work and paper!

I'm trying to reproduce SciERC results, but I'm getting only:

"best_validation__rel_f1": 0.33248730964467005,
"best_validation_ner_f1": 0.6711409395972655,

Paper is reporting 48.4 F1 score on relations extraction on SciERC.

My questions are:

Are there any additional steps to take, besides downloading SciERC dataset, to train model and reproduce F1 scores?
I have noticed that that validation loss is (almost constantly) increasing every epoch (train loss is dropping). Is this by expected behavior?
What was the reason that trainer decided to stop train script earlier with message "INFO - allennlp.training.trainer - Ran out of patience. Stopping training."?

Bellow I'm placing full installation step-by-step, log tail, system info and pip list:

Installation:

conda create --name dygie python=3.7
conda activate dygie
conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch
pip install allennlp
pip install botocore
git clone https://github.com/dwadden/dygiepp.git
cd dygiepp/
pip install -r requirements.txt
bash ./scripts/data/get_scierc.sh 
bash ./scripts/train/train_scierc.sh 0

Log tail:

[...]
2019-11-13 16:08:07,639 - WARNING - root - NaN or Inf found in input tensor.                                                                                                                                                                                                       
2019-11-13 16:08:07,640 - WARNING - root - NaN or Inf found in input tensor.                                                                                                                                                                                                       
ner_precision: 0.9754, ner_recall: 0.9725, ner_f1: 0.9740, loss: 4.5208 ||: 100%|#########9| 499/500 [01:05<00:00,  8.05it/s]2019-11-13 16:08:20,732 - WARNING - root - NaN or Inf found in input tensor.                                                                          
2019-11-13 16:08:20,732 - WARNING - root - NaN or Inf found in input tensor.                                                                                                                                                                                                       
2019-11-13 16:08:20,734 - WARNING - root - NaN or Inf found in input tensor.                                                                                                                                                                                                       
2019-11-13 16:08:20,734 - WARNING - root - NaN or Inf found in input tensor.                                                                                                                                                                                                       
2019-11-13 16:08:20,740 - WARNING - root - NaN or Inf found in input tensor.                                                                                                                                                                                                       
2019-11-13 16:08:20,740 - WARNING - root - NaN or Inf found in input tensor.                                                                                                                                                                                                       
2019-11-13 16:08:20,742 - WARNING - root - NaN or Inf found in input tensor.                                                                                                                                                                                                       
2019-11-13 16:08:20,742 - WARNING - root - NaN or Inf found in input tensor.                                                                                                                                                                                                       
ner_precision: 0.9755, ner_recall: 0.9726, ner_f1: 0.9740, loss: 4.5118 ||: 100%|##########| 500/500 [01:05<00:00,  7.62it/s]                                                                                                                                                      
2019-11-13 16:08:20,746 - INFO - allennlp.training.trainer - Validating                                                                                                                                                                                                            
ner_precision: 0.6449, ner_recall: 0.6873, ner_f1: 0.6655, loss: 232.1718 ||: 100%|##########| 50/50 [00:02<00:00, 21.36it/s]                                                                                                                                                      
2019-11-13 16:08:23,090 - INFO - allennlp.training.trainer - Ran out of patience.  Stopping training.                                                                                                                                                                              
2019-11-13 16:08:23,090 - INFO - allennlp.training.checkpointer - loading best weights                                                                                                                                                                                             
2019-11-13 16:08:23,219 - INFO - allennlp.commands.train - To evaluate on the test set after training, pass the 'evaluate_on_test' flag, or use the 'allennlp evaluate' command.                                                                                                   
2019-11-13 16:08:23,219 - INFO - allennlp.models.archival - archiving weights and vocabulary to ./models/scierc/model.tar.gz                                                                                                                                                       
2019-11-13 16:08:40,237 - INFO - allennlp.common.util - Metrics: {
  "best_epoch": 23,                                                                                                                                                                                                                                                                
  "peak_cpu_memory_MB": 3783.14,                                                                                                                                                                                                                                                   
  "peak_gpu_0_memory_MB": 8654,                                                                                                                                                                                                                                                    
  "training_duration": "0:43:55.006665",                                                                                                                                                                                                                                           
  "training_start_epoch": 0,                                                                                                                                                                                                                                                       
  "training_epochs": 37,                                                                                                                                                                                                                                                           
  "epoch": 37,                                                                                                                                                                                                                                                                     
  "training__coref_precision": 0.8032079153669152,                                                                                                                                                                                                                                 
  "training__coref_recall": 0.52068901932887,                                                                                                                                                                                                                                      
  "training__coref_f1": 0.6030049906804981,                                                                                                                                                                                                                                        
  "training__coref_mention_recall": 0.9909547738693467,                                                                                                                                                                                                                            
  "training_ner_precision": 0.9815991970558715,                                                                                                                                                                                                                                    
  "training_ner_recall": 0.9786524349566378,                                                                                                                                                                                                                                       
  "training_ner_f1": 0.9801236011357441,
  "training__rel_precision": 0.8399269628727937,
  "training__rel_recall": 0.8051341890315052,
  "training__rel_f1": 0.8221626452189454,
  "training__rel_span_recall": 0.837222870478413,
  "training__trig_id_precision": 0,
  "training__trig_id_recall": 0,
  "training__trig_id_f1": 0,
  "training__trig_class_precision": 0,
  "training__trig_class_recall": 0,
  "training__trig_class_f1": 0,
  "training__arg_id_precision": 0,
  "training__arg_id_recall": 0,
  "training__arg_id_f1": 0,
  "training__arg_class_precision": 0,
  "training__arg_class_recall": 0,
  "training__arg_class_f1": 0,
  "training__args_multiple": 0,
  "training_loss": 4.156547112312106,
  "training_cpu_memory_MB": 3783.14,
  "training_gpu_0_memory_MB": 8654,
  "validation__coref_precision": 0.5785864145810381,
  "validation__coref_recall": 0.40000249926139114,
  "validation__coref_f1": 0.47185963848741963,
  "validation__coref_mention_recall": 0.9338235294117647,
  "validation_ner_precision": 0.6489988221436984,
  "validation_ner_recall": 0.6836228287841191,
  "validation_ner_f1": 0.6658610271902824,
  "validation__rel_precision": 0.3944954128440367,
  "validation__rel_recall": 0.378021978021978,
  "validation__rel_f1": 0.3860830527497194,
  "validation__rel_span_recall": 0.4175824175824176,
  "validation__trig_id_precision": 0,
  "validation__trig_id_recall": 0,
  "validation__trig_id_f1": 0,
  "validation__trig_class_precision": 0,
  "validation__trig_class_recall": 0,
  "validation__trig_class_f1": 0,
  "validation__arg_id_precision": 0,
  "validation__arg_id_recall": 0,
  "validation__arg_id_f1": 0,
  "validation__arg_class_precision": 0,
  "validation__arg_class_recall": 0,
  "validation__arg_class_f1": 0,
  "validation__args_multiple": 0,
  "validation_loss": 223.13119384765625,
  "best_validation__coref_precision": 0.5206815500675314,
  "best_validation__coref_recall": 0.39612618358360496,
  "best_validation__coref_f1": 0.44944354440112394,
  "best_validation__coref_mention_recall": 0.9375,
  "best_validation_ner_precision": 0.6602641056422568,
  "best_validation_ner_recall": 0.6823821339950371,
  "best_validation_ner_f1": 0.6711409395972655,
  "best_validation__rel_precision": 0.3933933933933934,
  "best_validation__rel_recall": 0.2879120879120879,
  "best_validation__rel_f1": 0.33248730964467005,
  "best_validation__rel_span_recall": 0.3208791208791209,
  "best_validation__trig_id_precision": 0,
  "best_validation__trig_id_recall": 0,
  "best_validation__trig_id_f1": 0,
  "best_validation__trig_class_precision": 0,
  "best_validation__trig_class_recall": 0,
  "best_validation__trig_class_f1": 0,
  "best_validation__arg_id_precision": 0,
  "best_validation__arg_id_recall": 0,
  "best_validation__arg_id_f1": 0,
  "best_validation__arg_class_precision": 0,
  "best_validation__arg_class_recall": 0,
  "best_validation__arg_class_f1": 0,
  "best_validation__args_multiple": 0,
  "best_validation_loss": 142.15479759216308
}

System:

Ubuntu 18.04.3 LTS
GPU 0: GeForce RTX 2080 Ti
Driver Version: 418.88
CUDA Version: 10.1

pip list:

$ pip list                                                                                                                                                                                                                                                                   
Package                       Version                                                                                                                                                                                                                                              
----------------------------- -------------------                                                                                                                                                                                                                                  
alabaster                     0.7.12                                                                                                                                                                                                                                               
allennlp                      0.9.0                                                                                                                                                                                                                                                
atomicwrites                  1.3.0                                                                                                                                                                                                                                                
attrs                         19.3.0                                                                                                                                                                                                                                               
Babel                         2.7.0                                                                                                                                                                                                                                                
beautifulsoup4                4.8.1                                                                                                                                                                                                                                                
blis                          0.2.4                                                                                                                                                                                                                                                
boto3                         1.10.16                                                                                                                                                                                                                                              
botocore                      1.13.16                                                                                                                                                                                                                                              
certifi                       2019.9.11                                                                                                                                                                                                                                            
cffi                          1.13.1                                                                                                                                                                                                                                               
chardet                       3.0.4                                                                                                                                                                                                                                                
Click                         7.0                                                                                                                                                                                                                                                  
conllu                        1.3.1                                                                                                                                                                                                                                                
cycler                        0.10.0                                                                                                                                                                                                                                               
cymem                         2.0.2                                                                                                                                                                                                                                                
docutils                      0.15.2                                                                                                                                                                                                                                               
editdistance                  0.5.3                                                                                                                                                                                                                                                
flaky                         3.6.1                                                                                                                                                                                                                                                
Flask                         1.1.1                                                                                                                                                                                                                                                
Flask-Cors                    3.0.8                                                                                                                                                                                                                                                
ftfy                          5.6                                                                                                                                                                                                                                                  
gevent                        1.4.0                                                                                                                                                                                                                                                
greenlet                      0.4.15                                                                                                                                                                                                                                               
h5py                          2.10.0                                                                                                                                                                                                                                               
idna                          2.8                                                                                                                                                                                                                                                  
imagesize                     1.1.0                                                                                                                                                                                                                                                
importlib-metadata            0.23                                                                                                                                                                                                                                                 
itsdangerous                  1.1.0                                                                                                                                                                                                                                                
Jinja2                        2.10.3
jmespath                      0.9.4
joblib                        0.14.0
jsonnet                       0.14.0
jsonpickle                    1.2
kiwisolver                    1.1.0
lxml                          4.4.1
MarkupSafe                    1.1.1
matplotlib                    3.1.1
mkl-fft                       1.0.15
mkl-random                    1.1.0
mkl-service                   2.3.0
more-itertools                7.2.0
murmurhash                    1.0.2
nltk                          3.4.5
numpy                         1.17.3
numpydoc                      0.9.1
olefile                       0.46
overrides                     2.5
packaging                     19.2
pandas                        0.25.3
parsimonious                  0.8.1
Pillow                        6.2.1
pip                           19.3.1
plac                          0.9.6
pluggy                        0.13.0
preshed                       2.0.1
protobuf                      3.10.0
py                            1.8.0
pycparser                     2.19
Pygments                      2.4.2
pyparsing                     2.4.5
pytest                        5.2.2
python-dateutil               2.8.0
python-Levenshtein            0.12.0
pytorch-pretrained-bert       0.6.2
pytorch-transformers          1.1.0
pytz                          2019.3
regex                         2019.11.1
requests                      2.22.0
responses                     0.10.6
s3transfer                    0.2.1
scikit-learn                  0.21.3
scipy                         1.3.2
sentencepiece                 0.1.83
setuptools                    41.6.0.post20191030
six                           1.13.0
snowballstemmer               2.0.0
soupsieve                     1.9.5
spacy                         2.1.9
Sphinx                        2.2.1
sphinxcontrib-applehelp       1.0.1
sphinxcontrib-devhelp         1.0.1
sphinxcontrib-htmlhelp        1.0.2
sphinxcontrib-jsmath          1.0.1
sphinxcontrib-qthelp          1.0.2
sphinxcontrib-serializinghtml 1.1.3
sqlparse                      0.3.0
srsly                         0.2.0
tensorboardX                  1.9
thinc                         7.0.8
torch                         1.2.0
torchvision                   0.4.0a0+6b959ee
tqdm                          4.38.0
Unidecode                     1.1.1
urllib3                       1.25.7
wasabi                        0.4.0
wcwidth                       0.1.7
Werkzeug                      0.16.0
wheel                         0.33.6
word2number                   1.1
zipp                          0.6.0

How to use my dataset

I want to use my dataset, my data include *.ann and *.txt. How to convert data into project input data？ use tools? I hope you can give me some advice.

ScispaCy vs. Stanford NLP tokenization with SciERC model

Hi,

I'm trying to apply the SciERC pre-trained model to an unlabeled dataset of abstracts from plant science papers. I used the following command line to format my code:

python scripts/new-dataset/format_new_dataset.py ../knowledge-graph/data/first_manuscript_data/clustering_pipeline_output/JA_GA_chosen_abstracts/ ../knowledge-graph/data/first_manuscript_data/dygiepp/prepped_data/dygiepp_formatted_data_SciERC.jsonl scierc

where the directory JA_GA_chosen_abstracts contains a .txt file for each abstract.

I was then successfully able to run the pre-trained SciERC model on this data. However, when looking at the results, I noticed I was getting a lot of entities that were either a single round bracket, a single hyphen, or a word followed or preceded by a hyphen. When I looked more closely at the tokenized sentences in the preprocessed jsonl file, it was clear that this is because the spaCy tokenizer in ./scripts/new-dataset/format_new_dataset.py splits hyphenated words, and leaves parentheses/brackets as-is.

However, when I looked at the processed SciERC json files, it looks like they were tokenized with PTB3 token transforms ("(" becomes "-LRB-", etc.), and without splitting hyphenated words. A cursory google makes it seem like this tokenization may have been done with the Stanford NLP tokenizer, because it gives options to use PTB3 tranforms and to not split hyphenated words.

I checked out the webpage where the processed SciERC dataset is pulled from, and skimmed the paper and the repo, but didn't see anything that indicated to me how the dataset was tokenized. was wondering if you knew what tokenizer had been used on the SciERC data, and if you thought it would be better to use the same tokenization scheme on new datasets to get better performance with the pre-trained model. If it turns out it was done with the Stanford NLP tokenizer, I'd be more than happy to open a PR adding an option to use that tokenizer in format_new_dataset.py.

Thanks!

some questions about the codes in

Out of Memory

Hi,

I am training on a document of 40 sentences. some are long.
Do you have some advices to lower the GPU memory use (e.g. batch size (I cannot find the params as I am not familiar with allennlp) and sentence length)

I got out of memory in a 32GB Nvidia GPU.

Thanks

Train ChemProt model problem

I met a problem when I trained ChemProt model
when I run "bash ./scripts/train/train_chemprot.sh -1"

2020-05-27 14:17:59,641 - INFO - allennlp.training.optimizers - Number of trainable parameters: 123007840
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.optimizer.infer_type_and_cast = True
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently.
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS:
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.optimizer.lr = 0.001
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.optimizer.t_total = 10000
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.optimizer.warmup = 0.1
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.optimizer.weight_decay = 0
2020-05-27 14:17:59,642 - INFO - allennlp.common.registrable - instantiating registered subclass bert_adam of <class 'allennlp.training.optimizers.Optimizer'>
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.learning_rate_scheduler.type = reduce_on_plateau
2020-05-27 14:17:59,642 - INFO - allennlp.common.registrable - instantiating registered subclass reduce_on_plateau of <class 'allennlp.training.learning_rate_schedulers.learning_rate_scheduler.LearningRateScheduler'>
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently.
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS:
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.learning_rate_scheduler.factor = 0.5
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.learning_rate_scheduler.mode = max
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.learning_rate_scheduler.patience = 4
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.num_serialized_models_to_keep = 3
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.keep_serialized_model_every_num_seconds = None
2020-05-27 14:17:59,643 - INFO - allennlp.common.params - trainer.model_save_interval = None
2020-05-27 14:17:59,643 - INFO - allennlp.common.params - trainer.summary_interval = 100
2020-05-27 14:17:59,643 - INFO - allennlp.common.params - trainer.histogram_interval = None
2020-05-27 14:17:59,643 - INFO - allennlp.common.params - trainer.should_log_parameter_statistics = True
2020-05-27 14:17:59,643 - INFO - allennlp.common.params - trainer.should_log_learning_rate = False
2020-05-27 14:17:59,643 - INFO - allennlp.common.params - trainer.log_batch_size_period = None
2020-05-27 14:17:59,806 - INFO - allennlp.training.trainer - Beginning training.
2020-05-27 14:17:59,806 - INFO - allennlp.training.trainer - Epoch 0/249
2020-05-27 14:17:59,806 - INFO - allennlp.training.trainer - Peak CPU memory usage MB: 2505.212
2020-05-27 14:17:59,891 - INFO - allennlp.training.trainer - Training
0%| | 0/1299 [00:00<?, ?it/s]./scripts/train/train_chemprot.sh: line 18: 6835 Killed ie_train_data_path=$data_root/training.jsonl ie_dev_data_path=$data_root/development.jsonl ie_test_data_path=$data_root/test.jsonl cuda_device=$cuda_device allennlp train $config_file --cache-directory $data_root/cached --serialization-dir ./models/$experiment_name --include-package dygie

I think the reason for this problem is that the memmory is too small. I would like to hear your advice.
my cloud server:
2vCPUs | 4GB

Finetune pretrained model using labelled data

Hi,

I emailed you earlier regarding this. To make this more official, I Was wondering if you had any suggestions on how to finetune a pretrained model.

I.e. I have a set of annotated articles. I would love to use this data to finetune the ace-event model.

Thanks!

Event extraction training without Named Entities or Relations

My custom dataset has ACE-like event annotations with triggers and arguments but no NER or relations.
I tried running the ACE training pipeline on my event data and left the relations and ner keys empty, e.g.:

{
"doc_key": "aal00",
"sentences": [
    ["American", "Airlines", "Up", "on", "Record", "April", "Traffic", ",", "Upbeat", "Q2", "View"],
    ["Premier", "passenger", "carrier", ",", "American", "Airlines", "Group", "Inc", ".", "AAL", "saw", "its", "shares", "rise", "4.76", "%", "to", "$", "47.08", "at", "the", "close", "of", "business", "on", "Apr", "9", ",", "following", "the", "release", "of", "its", "traffic", "report", "for", "the", "month", "of", "April", "."]
],
"events": [
    [ ],
    [
        [[24, "SecurityValue"], [25, 26, "IncreaseAmount"], [23, 23, "Security"], [30, 37, "TIME"], [28, 29, "Price"]],
        [[45, "FinancialReport"], [43, 43, "Reportee"], [46, 50, "TIME"]]
    ]
],
"ner": [
    [ ],
    [ ]
],
"relations": [
    [ ],
    [ ]
],
"clusters": [
    [ ],
    [ ]
]
}

I configured the .jsonnet file based off train_ace05_event.jsonnet changing
n_trigger_labels to the amount of event types and n_ner_labels: 0, because I have no NER annotations.
Those were all the relevant config keys I could identify from the file itself, I probably missed some because training failed with the following error:

2020-08-06 14:18:05,035 - INFO - allennlp.training.trainer - Training
  0%|          | 0/365 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/opt/conda/envs/dygiepp/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
    main(prog="allennlp")
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 102, in main
    args.func(args)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 124, in train_model_from_args
    args.cache_prefix)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 168, in train_model_from_file
    cache_directory, cache_prefix)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 252, in train_model
    metrics = trainer.train()
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 478, in train
    train_metrics = self._train_epoch(epoch)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 320, in _train_epoch
    loss = self.batch_loss(batch_group, for_training=True)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 261, in batch_loss
    output_dict = self.model(**batch)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "./dygie/models/dygie.py", line 298, in forward
    ner_labels, metadata)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "./dygie/models/events.py", line 267, in forward
    trig_arg_embeddings, top_trig_scores, top_arg_scores, top_arg_mask)
  File "./dygie/models/events.py", line 553, in _compute_argument_scores
    embeddings_flat = pairwise_embeddings.view(-1, feature_dim)
RuntimeError: shape '[-1, 2642]' is invalid for input of size 17840250

Is it possible to train Event extraction with only events (trigger + arguments) without NER and relation annotations?
Am I missing a config key to set here? I suspect the amount of argument types has to be set somewhere but the config for this is not obvious.

KeyError: 'None__ner_labels' when predicting on new dataset.

I encounter the problem KeyError: 'None__ner_labels', when I try to use dygiepp to predict on new dataset. The following is the detail:

Traceback (most recent call last):
File "/home/wangpancheng/anaconda3/envs/dygiepp/bin/allennlp", line 8, in
sys.exit(run())
File "/home/wangpancheng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/main.py", line 34, in run
main(prog="allennlp")
File "/home/wangpancheng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/init.py", line 118, in main
args.func(args)
File "/home/wangpancheng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/predict.py", line 220, in _predict
manager.run()
File "/home/wangpancheng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/predict.py", line 187, in run
for model_input_instance, result in zip(batch, self._predict_instances(batch)):
File "/home/wangpancheng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/predict.py", line 146, in _predict_instances
results = [self._predictor.predict_instance(batch_data[0])]
File "/serverData2/wangpancheng/wpc/dygiepp/dygie/predictors/dygie.py", line 56, in predict_instance
prediction = model.make_output_human_readable(model(**model_input)).to_json()
File "/home/wangpancheng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/serverData2/wangpancheng/wpc/dygiepp/dygie/models/dygie.py", line 239, in forward
spans, span_mask, span_embeddings, sentence_lengths, ner_labels, metadata)
File "/home/wangpancheng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/serverData2/wangpancheng/wpc/dygiepp/dygie/models/ner.py", line 90, in forward
scorer = self._ner_scorers[self._active_namespace]
File "/home/wangpancheng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/torch/nn/modules/container.py", line 286, in getitem
return self._modules[key]
KeyError: 'None__ner_labels'

Missing events config template

Hi,
thank you very much for providing your source code!
I know that training on ace events is wip but if you would push your local template we could try to figure out how to train in the meantime

using dygiepp for other languages than english

Hi,

I would use dygiepp on the dutch language.
It means I should use other embeddings than Bert. There is a Dutch version 'Bertje'.
I would be grateful if you could give me a clue where to start to adapt in your script, or in allennlp to use this Dutch version.

Thanks a lot

ACE05 Chinese and Arabic dataset processing

Currently, the data preprocessing step only process the English data. Is it possible to process the Chinese and Arabic portion of the ACE05 dataset?

Deploy as a REST service

Hello there!
Can you recommend a way to deploy Dygiepp as a REST API.
Thanks,
NR

How to make inferences from the pretrained model

I have not yet dived deep into the model code. I ran the allennlp predict as mentioned in the docs. It printed the inferences in the terminal. So is there any way to predict the relations,events etc from the shell.
A demo in Jupyter Notebook will be very useful.

Training with `ace05_best_ner_bert.jsonnet` returns error: `str' object has no attribute 'get_lr'`

Using the following code:

import json
import sys

from allennlp.commands import main

overrides = json.dumps({"trainer": {"cuda_device": -1}})
runs = [
    ( "../ace05_best_relation_bert.jsonnet", "../data/relation" )
]

for config_file, serialization_dir in runs:

    sys.argv = [
        "allennlp",  
        "train",
        config_file,
        "-s", serialization_dir,
        "--include-package", "dygie",
        "--include-package", "ie_json",
        "-o", overrides,
    ]

    main()

Attempting to run ace05_best_ner_bert.jsonnet returns the error:

I'm using the versions of allennlp and torch suggested in requirements.txt:

Branch #allennlp-v1 not usable

I am testing the branch #allennlp-v1 to run the ACE-Event training code which has some code issues regarding imports.

 (dygiepp) root@0a587743b213:/dygiepp# rm -rf ./models/ace05-event; bash ./scripts/train/train_ace05_event.sh 0
2020-08-03 08:58:29,323 - INFO - transformers.file_utils - PyTorch version 1.5.1 available.
Traceback (most recent call last):
  File "/opt/conda/envs/dygiepp/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/__main__.py", line 19, in run
    main(prog="allennlp")
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 91, in main
    import_module_and_submodules(package_name)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/common/util.py", line 351, in import_module_and_submodules
    import_module_and_submodules(subpackage)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/common/util.py", line 340, in import_module_and_submodules
    module = importlib.import_module(package_name)
  File "/opt/conda/envs/dygiepp/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/dygiepp/dygie/data/__init__.py", line 3, in <module>
    from dygie.data.iterators.batch_iterator import BatchIterator
  File "/dygiepp/dygie/data/iterators/batch_iterator.py", line 10, in <module>
    from allennlp.data.dataloader import DataLoader, PyTorchDataLoader
ImportError: cannot import name 'PyTorchDataLoader' from 'allennlp.data.dataloader' (/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/data/dataloader.py)
(dygiepp) root@0a587743b213:/dygiepp# python -c "from allennlp.data.dataloader import PyTorchDataLoader"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: cannot import name 'PyTorchDataLoader' from 'allennlp.data.dataloader' (/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/data/dataloader.py)

I also tested the branch two weeks ago at commit 07074fd and encountered multiple issues with sen_dict() missing in the dataloader code, indicating refactoring is still ongoing.

Is the allennlp-v1 branch meant to be used now or will it be merged to master when ready to use?
I am planning on making adjustments to dygiepp to run on a my own ERE-like event dataset and the upgraded dependencies would make Apex usable which is a big plus for me.

In any case, thanks for making DYGIE++ source available and maintaining it!

Question about the number of documents of ACE2005 event extraction

Thanks for making a comprehensive summarization of Event Extraction pre-processing.

I am confused by the number of documents for each split. File dev.filelist, test.filelist and train.filelist have 28, 40, 529 lines, which in total makes 597 documents, but Table 8 in the Appendix depicts there are 599 documents.

KeyError: 'NewsDNA__argument_labels'

I tried to train and test with a new dataset that I formatted similar to ace event data

However, I keep having the next error

File "/home/thierry/repos/dygiepp-dev/dygie/models/events.py", line 177, in forward
mention_pruner = self._mention_pruners[self._active_namespaces["argument"]]
File "/home/thierry/miniconda3/lib/python3.8/site-packages/torch/nn/modules/container.py", line 286, in getitem
return self._modules[key]

I assume I should specify the used labels somewhere ?

Thanks for helping

Coreference propagation on ACE05

In the paper you mention as ACE does not have coreference annotation, you use OntoNotes for coreference propagation. I was wondering if it is possible to predict coreference on ACE using the current trained model?

ACE05 data preprocess problem: get_token_of

Hi, when I ran the preprocessing script parse_ace_event.py, I got the Exception: Should not get here from the get_token_of function. I found that it was the char index couldn't match the token index which may due to the tokenization. Have you ever met such problem, and how should i fix the problem?

Model tests are failing because of missing `span_emb_dim` key

All models tests are failing with message: allennlp.common.checks.ConfigurationError: 'key "span_emb_dim" is required at location "coref."'

(dygie) ~/ml/dygiepp/dygie(debugging) $ python -mpytest tests
================================================= test session starts ==================================================
platform linux -- Python 3.7.5, pytest-5.2.2, py-1.8.0, pluggy-0.13.0
rootdir: /home/konrad/ml/dygiepp/dygie, inifile: pytest.ini
plugins: flaky-3.6.1
collected 15 items                                                                                                     

tests/data/ie_json_test.py ........                                                                              [ 53%]
tests/models/coref_test.py F                                                                                     [ 60%]
tests/models/dygie_test.py F                                                                                     [ 66%]
tests/models/relation_test.py FFFFF                                                                              [100%]
[...]

How to run my dataset

I have some problem here.

2020-06-11 21:22:09,219 - INFO - allennlp.training.trainer - Beginning training.
2020-06-11 21:22:09,219 - INFO - allennlp.training.trainer - Epoch 0/249
2020-06-11 21:22:09,219 - INFO - allennlp.training.trainer - Peak CPU memory usage MB: 3337.832
2020-06-11 21:22:09,275 - INFO - allennlp.training.trainer - Training
0%| | 0/1990 [00:03<?, ?it/s]
Traceback (most recent call last):
File "/root/miniconda3/envs/dygiepp/bin/allennlp", line 8, in
sys.exit(run())
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
main(prog="allennlp")
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/init.py", line 102, in main
args.func(args)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 124, in train_model_from_args
args.cache_prefix)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 168, in train_model_from_file
cache_directory, cache_prefix)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 252, in train_model
metrics = trainer.train()
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 478, in train
train_metrics = self._train_epoch(epoch)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 320, in _train_epoch
loss = self.batch_loss(batch_group, for_training=True)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 261, in batch_loss
output_dict = self.model(**batch)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "./dygie/models/dygie.py", line 280, in forward
output_relation = self._relation.predict_labels(relation_labels, output_relation, metadata)
File "./dygie/models/relation.py", line 193, in predict_labels
predictions = self.decode(output_dict)["decoded_relations_dict"]
File "./dygie/models/relation.py", line 218, in decode
top_spans, predicted_relations, num_spans_to_keep)
File "./dygie/models/relation.py", line 249, in _decode_sentence
label_name = self.vocab.get_token_from_index(label, namespace="relation_labels")
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/data/vocabulary.py", line 644, in get_token_from_index
return self._index_to_token[namespace][index]
KeyError: 0

Failed building wheel for jsonnet

run "pip install -r requirements.txt" failed

Downloading http://mirrors.tencentyun.com/pypi/packages/62/1e/a94a8d635fa3ce4cfc7f506003548d0a2447ae76fd5ca53932970fe3053f/pyasn1-0.4.8-py2.py3-none-any.whl (77 kB)
|████████████████████████████████| 77 kB 691 kB/s
Building wheels for collected packages: en-core-sci-sm, jsonnet
Building wheel for en-core-sci-sm (setup.py) ... done
Created wheel for en-core-sci-sm: filename=en_core_sci_sm-0.2.3-py3-none-any.whl size=16230213 sha256=a78b379ac4f2ce377645373d63a8b5419e2e63bf9b9f02db5e4f107a84930d27
Stored in directory: /root/.cache/pip/wheels/7c/c8/0d/0db35734344a895a1a329527f24538c21f1442878b8448d7d4
Building wheel for jsonnet (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: /root/miniconda3/envs/dygiepp/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-fout_ucm/jsonnet/setup.py'"'"'; file='"'"'/tmp/pip-install-fout_ucm/jsonnet/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-py44whdj
cwd: /tmp/pip-install-fout_ucm/jsonnet/
Complete output (33 lines):
running bdist_wheel
running build
running build_ext
g++ -c -g -O3 -Wall -Wextra -Woverloaded-virtual -pedantic -std=c++0x -fPIC -Iinclude -Ithird_party/md5 -Ithird_party/json core/desugarer.cpp -o core/desugarer.o
make: g++: Command not found
make: * [core/desugarer.o] Error 127
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-install-fout_ucm/jsonnet/setup.py", line 75, in
test_suite="python._jsonnet_test",
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/setuptools/init.py", line 144, in setup
return distutils.core.setup(attrs)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/core.py", line 148, in setup
dist.run_commands()
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 223, in run
self.run_command('build')
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/command/build.py", line 135, in run
self.run_command(cmd_name)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/tmp/pip-install-fout_ucm/jsonnet/setup.py", line 54, in run
raise Exception('Could not build %s' % (', '.join(LIB_OBJECTS)))
Exception: Could not build core/desugarer.o, core/formatter.o, core/libjsonnet.o, core/lexer.o, core/parser.o, core/pass.o, core/static_analysis.o, core/string_utils.o, core/vm.o, third_party/md5/md5.o

ERROR: Failed building wheel for jsonnet
Running setup.py clean for jsonnet
Successfully built en-core-sci-sm
Failed to build jsonnet
Installing collected packages: unidecode, itsdangerous, Werkzeug, flask, flask-cors, parsimonious, sqlparse, jmespath, botocore, s3transfer, boto3, pytorch-pretrained-bert, jsonnet, h5py, overrides, conllu, threadpoolctl, scipy, scikit-learn, editdistance, greenlet, gevent, protobuf, tensorboardX, wcwidth, ftfy, zipp, importlib-metadata, jsonpickle, blis, cymem, preshed, murmurhash, plac, srsly, wasabi, thinc, spacy, sentencepiece, pytorch-transformers, responses, attrs, py, more-itertools, pluggy, pytest, allennlp, pandas, soupsieve, beautifulsoup4, lxml, python-Levenshtein, PyYAML, pyasn1, rsa, colorama, awscli, pybind11, psutil, nmslib, scispacy, en-core-sci-sm
Running setup.py install for jsonnet ... error
ERROR: Command errored out with exit status 1:
command: /root/miniconda3/envs/dygiepp/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-fout_ucm/jsonnet/setup.py'"'"'; file='"'"'/tmp/pip-install-fout_ucm/jsonnet/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-nukhhlr0/install-record.txt --single-version-externally-managed --compile --install-headers /root/miniconda3/envs/dygiepp/include/python3.7m/jsonnet
cwd: /tmp/pip-install-fout_ucm/jsonnet/
Complete output (35 lines):
running install
running build
running build_ext
g++ -c -g -O3 -Wall -Wextra -Woverloaded-virtual -pedantic -std=c++0x -fPIC -Iinclude -Ithird_party/md5 -Ithird_party/json core/desugarer.cpp -o core/desugarer.o
make: g++: Command not found
make: *** [core/desugarer.o] Error 127
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-install-fout_ucm/jsonnet/setup.py", line 75, in
test_suite="python._jsonnet_test",
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/setuptools/init.py", line 144, in setup
return distutils.core.setup(**attrs)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/core.py", line 148, in setup
dist.run_commands()
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/setuptools/command/install.py", line 61, in run
return orig.install.run(self)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/command/install.py", line 545, in run
self.run_command('build')
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/command/build.py", line 135, in run
self.run_command(cmd_name)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/tmp/pip-install-fout_ucm/jsonnet/setup.py", line 54, in run
raise Exception('Could not build %s' % (', '.join(LIB_OBJECTS)))
Exception: Could not build core/desugarer.o, core/formatter.o, core/libjsonnet.o, core/lexer.o, core/parser.o, core/pass.o, core/static_analysis.o, core/string_utils.o, core/vm.o, third_party/md5/md5.o
----------------------------------------
ERROR: Command errored out with exit status 1: /root/miniconda3/envs/dygiepp/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-fout_ucm/jsonnet/setup.py'"'"'; file='"'"'/tmp/pip-install-fout_ucm/jsonnet/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-nukhhlr0/install-record.txt --single-version-externally-managed --compile --install-headers /root/miniconda3/envs/dygiepp/include/python3.7m/jsonnet Check the logs for full command output.

question about param _n_labels

why the param in code TimeDistributed(torch.nn.Linear(mention_feedforward.get_output_dim(), self._n_labels - 1)) is self._n_labels - 1 , should it be self._n_labels ?
And on the other hand,why do you design dummy_scores in ner.py ?

ace05_event.jsonnet embeddings shape size error

First, thanks for all your great work!

I've put my own labeled data in the DyGIE++ format, following the instructions in (https://github.com/dwadden/dygiepp/blob/master/DATA.md). I was able to successfully train the Relation extractor fine using the ace05_best_relation_bert.jsonnet config, with quite good performance.

When I next tried to train for Events using ace05_event.jsonnet, however, I ran into the following error:

Following the comments in https://github.com/dwadden/dygiepp/blob/master/training_config/template_dw.libsonnet#L88, I then made a copy of ace05_event.jsonnet, changing the n_trigger_labels and n_ner_labels values to reflect my own dataset counts, ie:

ace05_event_copy.jsonnet

  n_trigger_labels: 28,    // prev: 34
  n_ner_labels: 103,       // prev: 8

However I'm still getting the error, RuntimeError: shape '[-1, 2745]' is invalid for input of size 5601840 in https://github.com/dwadden/dygiepp/blob/master/dygie/models/events.py#L553.

Using a Fine Tuned SciBERT

I wanted to replace the Pre-trained SciBERT model with a Fine Tuned SciBERT Model. I achieved the Language Model Fine Tuning via this blog: https://github.com/Nikoschenk/language_model_finetuning/blob/master/scibert_fine_tuner.ipynb.

It uses the HuggingFace Library to achieve the Fine Tuning. The resulting HuggingFace Model has these components:

```
  Config.json
```
```
  Pytorch_model.bin
```
```
  Special_tokens_map.json
```
```
  Tokenizer_config.json
```
```
  Vocab.txt
```
I noticed that when I run Dygie’s get_scibert.py script, it downloads the Pytorch model as follows:
scibert_scivocab_cased/weights.tar.gz
scibert_scivocab_cased/vocab.txt

Further, weights.tar.gz, is made up of pytorch_model.bin & bert_config.json.

I repackaged HuggingFace outputs into the new weights.tar.gz (pytorch_model.bin & config.json renamed as bert_config.json).

It works fine.

I just wanted a second opinion, about the above approach. Pls advise.

KeyError: "'' not found in vocab namespace 'scierc__ner_labels

I am trying to run pretrained sciERC model on preprocessed sciERC dataset. I run the following command to do so:

allennlp predict pretrained/scierc.tar.gz \ data/processed_data/json/test.json \ --predictor dygie \ --include-package dygie \ --use-dataset-reader \ --output-file predictions/scierc-test.jsonl \ --cuda-device -1 \ --silent

I run into the following error:

Traceback (most recent call last):
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/data/vocabulary.py", line 724, in get_token_index
return self._token_to_index[namespace][token]
KeyError: ''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/data/vocabulary.py", line 727, in get_token_index
return self._token_to_index[namespace][self._oov_token]
KeyError: '@@unknown@@'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/yelman/anaconda3/envs/dygiepp/bin/allennlp", line 8, in
sys.exit(run())
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/main.py", line 34, in run
main(prog="allennlp")
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/init.py", line 119, in main
args.func(args)
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/predict.py", line 224, in _predict
predictor = _get_predictor(args)
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/predict.py", line 119, in _get_predictor
overrides=args.overrides,
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/models/archival.py", line 208, in load_archive
model = _load_model(config.duplicate(), weights_path, serialization_dir, cuda_device)
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/models/archival.py", line 246, in _load_model
cuda_device=cuda_device,
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/models/model.py", line 406, in load
return model_class._load(config, serialization_dir, weights_file, cuda_device)
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/models/model.py", line 305, in _load
vocab=vocab, params=model_params, serialization_dir=serialization_dir
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/common/from_params.py", line 604, in from_params
**extras,
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/common/from_params.py", line 634, in from_params
return constructor_to_call(**kwargs) # type: ignore
File "/home/yelman/Desktop/dygiepp-master/dygie/models/dygie.py", line 111, in init
params=modules.pop("ner"))
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/common/from_params.py", line 634, in from_params
return constructor_to_call(**kwargs) # type: ignore
File "/home/yelman/Desktop/dygiepp-master/dygie/models/ner.py", line 50, in init
null_label = vocab.get_token_index("", namespace)
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/data/vocabulary.py", line 732, in get_token_index
f"'{token}' not found in vocab namespace '{namespace}', and namespace "
KeyError: "'' not found in vocab namespace 'scierc__ner_labels', and namespace does not contain the default OOV token ('@@unknown@@')"

Trying to apply the pre-trained ACE05-Event model to new data

Hi there! We're trying use the pre-trained ACE05-Event model for a project on new data. However, we've been struggling to get it to run. In order to test it out, we tried the following on a line of data from the SciERC dataset and got the error below.

!echo '{"events": [[], [], [], [], []], "clusters": [[[6, 17], [32, 32]], [[4, 4], [55, 55], [91, 91]], [[58, 62], [64, 64], [79, 79]]], "sentences": [["This", "paper", "presents", "an", "algorithm", "for", "computing", "optical", "flow", ",", "shape", ",", "motion", ",", "lighting", ",", "and", "albedo", "from", "an", "image", "sequence", "of", "a", "rigidly-moving", "Lambertian", "object", "under", "distant", "illumination", "."], ["The", "problem", "is", "formulated", "in", "a", "manner", "that", "subsumes", "structure", "from", "motion", ",", "multi-view", "stereo", ",", "and", "photo-metric", "stereo", "as", "special", "cases", "."], ["The", "algorithm", "utilizes", "both", "spatial", "and", "temporal", "intensity", "variation", "as", "cues", ":", "the", "former", "constrains", "flow", "and", "the", "latter", "constrains", "surface", "orientation", ";", "combining", "both", "cues", "enables", "dense", "reconstruction", "of", "both", "textured", "and", "texture-less", "surfaces", "."], ["The", "algorithm", "works", "by", "iteratively", "estimating", "affine", "camera", "parameters", ",", "illumination", ",", "shape", ",", "and", "albedo", "in", "an", "alternating", "fashion", "."], ["Results", "are", "demonstrated", "on", "videos", "of", "hand-held", "objects", "moving", "in", "front", "of", "a", "fixed", "light", "and", "camera", "."]], "ner": [[[4, 4, "Generic"], [6, 17, "Task"], [20, 21, "Material"], [24, 26, "Material"], [28, 29, "OtherScientificTerm"]], [[32, 32, "Generic"], [42, 42, "Material"], [44, 45, "Material"], [48, 49, "Material"]], [[55, 55, "Generic"], [58, 62, "OtherScientificTerm"], [64, 64, "Generic"], [67, 67, "Generic"], [69, 69, "OtherScientificTerm"], [72, 72, "Generic"], [74, 75, "OtherScientificTerm"], [79, 79, "Generic"], [81, 88, "Task"]], [[91, 91, "Generic"], [95, 105, "Method"]], [[115, 118, "Material"]]], "relations": [[[4, 4, 6, 17, "USED-FOR"], [20, 21, 4, 4, "USED-FOR"], [24, 26, 20, 21, "FEATURE-OF"], [28, 29, 24, 26, "FEATURE-OF"]], [[42, 42, 44, 45, "CONJUNCTION"], [44, 45, 48, 49, "CONJUNCTION"]], [[58, 62, 55, 55, "USED-FOR"], [67, 67, 64, 64, "HYPONYM-OF"], [67, 67, 69, 69, "USED-FOR"], [67, 67, 72, 72, "CONJUNCTION"], [72, 72, 64, 64, "HYPONYM-OF"], [72, 72, 74, 75, "USED-FOR"], [79, 79, 81, 88, "USED-FOR"]], [[95, 105, 91, 91, "USED-FOR"]], []], "doc_key": "ICCV_2003_158_abs"}' > scierc-test.jsonl
!allennlp predict pretrained/ace05-event.tar.gz \
    scierc-test.jsonl \
    --predictor dygie \
    --include-package dygie \
    --use-dataset-reader \
    --output-file predictions/scierc-test.jsonl

This is the error:

2020-06-09 22:19:37,555 - ERROR - allennlp.data.vocabulary - Namespace: ner_labels
2020-06-09 22:19:37,555 - ERROR - allennlp.data.vocabulary - Token: Generic
Traceback (most recent call last):
  File "/usr/local/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/usr/local/lib/python3.6/dist-packages/allennlp/run.py", line 18, in run
    main(prog="allennlp")
  File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/__init__.py", line 102, in main
    args.func(args)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/predict.py", line 227, in _predict
    manager.run()
  File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/predict.py", line 201, in run
    for model_input_instance, result in zip(batch, self._predict_instances(batch)):
  File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/predict.py", line 159, in _predict_instances
    results = [self._predictor.predict_instance(batch_data[0])]
  File "./dygie/predictors/dygie.py", line 81, in predict_instance
    dataset.index_instances(model.vocab)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/data/dataset.py", line 155, in index_instances
    instance.index_fields(vocab)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/data/instance.py", line 72, in index_fields
    field.index(vocab)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/data/fields/sequence_label_field.py", line 100, in index
    for label in self.labels]
  File "/usr/local/lib/python3.6/dist-packages/allennlp/data/fields/sequence_label_field.py", line 100, in <listcomp>
    for label in self.labels]
  File "/usr/local/lib/python3.6/dist-packages/allennlp/data/vocabulary.py", line 637, in get_token_index
    return self._token_to_index[namespace][self._oov_token]
KeyError: '@@UNKNOWN@@'

We aren't super familiar with allennlp, so we aren't sure how to diagnose this issue. We're not sure why there is an issue with the NER_labels as we're hoping to just predict those, rather than train the model on them. Is there a way to override this?

Thanks a lot!

./models/scierc/model.tar.gz not found

When I run this command bash ./scripts/train/train_genia.sh 0. I got this error.Where to get model.tar.gz

Is this repository suitable for keywords extraction and chinese word sementation?

Hi ,

I thinks span representation is a great idea. Do you think the span representation is suitable for keywords extraction and chinese word sementation?

Thanks!

Multi-token triggers for event extraction

I want to run the event extraction pipeline on my own dataset of ACE/ERE-like events.
The events have multi-token and discontinuous trigger spans.

I have not yet dived deep into the model code, but was hoping for some ideas how to adapt it to multi-token spans.

Is it possible with the current approach to model multi-token triggers?
Is it possible to model discontinuous triggers?
Which functions in the code are prime candidates for adaptation?

For now, I will parse and use the head token of the trigger annotations, but due to the discriminative and content-rich nature of my multi-token triggers I expect it to hurt trigger classification.

Thank you for making this code available and supporting it. Having a SotA event extraction system available really helps my research.

Dockerfile expects script that was removed

It looks like scripts/pretrained/get_scibert.py was removed in this commit, but the Dockerfile still expects it

$ docker build .

420100K .......... .......... .......... .......... .......... 99% 2.89M 0s
420150K .......... .......... .......... .......... .......... 99% 4.39M 0s
420200K .......... ...                                        100%  108M=4m20s

2021-05-13 18:44:49 (1.58 MB/s) - ‘./pretrained/mechanic-granular.tar.gz’ saved [430298612/430298612]


Removing intermediate container 2024d2013fad
 ---> b44f296f2ce3
Step 19/27 : COPY scripts/pretrained/get_scibert.py /tmp/get_scibert.py
COPY failed: stat /var/lib/docker/tmp/docker-builder787936846/scripts/pretrained/get_scibert.py: no such file or directory

I understand that the Dockerfile isn't officially supported; just wanna log this here in case someone encounters the same issue

Cannot disable NER when training events

I tried setting the loss_weights: ner: to 0, but training errors with

2020-08-14 08:45:43,915 - INFO - allennlp.training.trainer - Training
  0%|          | 0/183 [00:04<?, ?it/s]
Traceback (most recent call last):
  File "/opt/conda/envs/dygiepp/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
    main(prog="allennlp")
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 102, in main
    args.func(args)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 124, in train_model_from_args
    args.cache_prefix)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 168, in train_model_from_file
    cache_directory, cache_prefix)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 252, in train_model
    metrics = trainer.train()
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 478, in train
    train_metrics = self._train_epoch(epoch)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 320, in _train_epoch
    loss = self.batch_loss(batch_group, for_training=True)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 256, in batch_loss
    output_dict = training_util.data_parallel(batch_group, self.model, self._cuda_devices)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/util.py", line 331, in data_parallel
    outputs = parallel_apply(replicas, inputs, moved, used_device_ids)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/torch/_utils.py", line 369, in reraise
    raise self.exc_type(msg)
KeyError: Caught KeyError in replica 0 on device 2.
Original Traceback (most recent call last):
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "./dygie/models/dygie.py", line 298, in forward
    ner_labels, metadata)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "./dygie/models/events.py", line 183, in forward
    ner_scores = output_ner["ner_scores"]
KeyError: 'ner_scores'

Seems to that in events.py the ner_scores are not optional right now as they are passed as a req. arg. to the _mention_pruner func.

For now I am testing the pipeline by producing NER predictions from the pre-trained ACE05-event model on my custom data and feeding that into the model for silver-standard reference NER labels.

I set event_args_use_ner_labels to false too. Is this required or are the NER labels referred to here model predictions not based off gold-standard labels?
I get higher F1 for TrigC and ArgC when setting this to false when training with NER labels produced by prediction from the pre-trained ACE05 model.)

ValueError: The following unexpected fields should be prefixed with an underscore: sentence_start.

Hi,

this is the first time I use dygiepp, so I am quite unexperienced.

I used the processed ace-event dataset
when starting the training I get error

ValueError: The following unexpected fields should be prefixed with an underscore: sentence_start.

one of the fields in the used .json files is
"sentence_start"..."

In document.py I find

    "Make sure we only have allowed fields."
    allowed_field_regex = ("doc_key|dataset|sentences|weight|.*ner$|"
                           ".*relations$|.*clusters$|.*events$|^_.*")

Prefixing sentence_start with underscore yields another error message

Thanks for helping me out

dataset format

I have used the code to preprocess the ACE05 dataset. I am wondering if you could explain the output format.

I got 351/80/80 lines in train/dev/test json files. One line from the train.json file looks like below.

{"sentences": [["CNN_ENG_20030515_073019", ".7"], ["NEWS", "STORY"], ["2003-05-15", "09:52:27"], ["earlier", "we", "talk", "about", "a", "new", "book", "claiming", "that", "president", "john", "kennedy", "had", "an", "affair", "with", "a", "white", "house", "intern", "early", "1960s", "."], ["a", "kennedy", "biographer", "robert", "dallek", "came", "across", "the", "story", "while", "doing", "research", "."], ["the", "woman", "'s", "name", "has", "remain", "a", "mystery", "."], ["the", "60-year-old", "tells", "the", "new", "york", "daily", "news", "and", "others", "dpa", "she", "is", "glad", "to", "have", "the", "weight", "she", "'s", "been", "carrying", "for", "41", "years", "now", "off", "her", "shoulders", "."], ["she", "says", "she", "was", "19", "at", "the", "time", ",", "working", "in", "d.c", ".", "at", "the", "white", "house", ",", "1962", ",", "1963", ",", "she", "says", "today", "the", "allegations", "about", "her", "affair", "are", "the", "truth", "."], ["right", "now", "she", "lives", "on", "the", "upper", "east", "side", ",", "works", "at", "a", "presbyterian", "church", ",", "has", "two", "married", "daughters", "and", "after", "the", "news", "is", "out", ",", "after", "carrying", "it", "for", "41", "years", ",", "she", "feels", "better", "about", "it", "."], ["this", "news", "breaking", "just", "today", "."], ["robert", "dallek", "tried", "to", "do", "the", "research", "to", "track", "this", "woman", "down", "in", "the", "book", ",", "it", "did", "not", "happen", "but", "now", "she", "has", "indeed", "come", "forward", "."], ["2003-05-15", "09:53:14"]], "ner": [[], [], [], [[7, 7, "PER"], [16, 17, "PER"], [15, 15, "PER"], [25, 25, "PER"], [23, 24, "ORG"]], [[30, 30, "PER"], [31, 31, "PER"], [32, 33, "PER"]], [[43, 43, "PER"]], [[52, 52, "PER"], [62, 62, "PER"], [69, 69, "PER"], [78, 78, "PER"], [55, 58, "ORG"], [60, 60, "ORG"]], [[92, 92, "GPE"], [83, 83, "PER"], [103, 103, "PER"], [109, 109, "PER"], [81, 81, "PER"], [96, 97, "ORG"]], [[121, 123, "LOC"], [134, 134, "PER"], [129, 129, "ORG"], [117, 117, "PER"], [149, 149, "PER"]], [], [[171, 171, "PER"], [183, 183, "PER"], [161, 162, "PER"]], []], "relations": [[], [], [], [[16, 17, 25, 25, "PER-SOC"], [25, 25, 23, 24, "ORG-AFF"]], [], [], [], [[83, 83, 96, 97, "ORG-AFF"], [81, 81, 92, 92, "PHYS"]], [[117, 117, 121, 123, "GEN-AFF"], [117, 117, 129, 129, "ORG-AFF"], [117, 117, 134, 134, "PER-SOC"]], [], [], []], "clusters": [], "doc_key": "CNN_ENG_20030515_073019.7"}

AssertionError: No super class method found for "_instances_from_cache_file"

Hi,

Thank you for the excellent work. I follow the steps by setting up the environment and install all requirements exactly. I want to download the scierc dataset but have the following problem.
I ran these command exactly:

conda create --name dygiepp python=3.7
pip install -r requirements.txt
conda develop .
bash ./scripts/data/get_scierc.sh

This is the problem I have:

Traceback (most recent call last):
  File "scripts/data/shared/normalize.py", line 5, in <module>
    from dygie.data.dataset_readers.document import Document, Dataset
  File "/ldap_shared/home/v_yuchen_zeng/696ds/dygiepp-master/dygie/data/__init__.py", line 1, in <module>
    from dygie.data.dataset_readers.dygie import DyGIEReader
  File "/ldap_shared/home/v_yuchen_zeng/696ds/dygiepp-master/dygie/data/dataset_readers/dygie.py", line 29, in <module>
    class DyGIEReader(DatasetReader):
  File "/ldap_shared/home/v_yuchen_zeng/696ds/dygiepp-master/dygie/data/dataset_readers/dygie.py", line 202, in DyGIEReader
    @overrides
  File "/ldap_shared/home/v_yuchen_zeng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/overrides/overrides.py", line 67, in overrides
    raise AssertionError('No super class method found for "%s"' % method.__name__)
AssertionError: No super class method found for "_instances_from_cache_file"
Traceback (most recent call last):
  File "scripts/data/shared/collate.py", line 4, in <module>
    from dygie.data.dataset_readers import document
  File "/ldap_shared/home/v_yuchen_zeng/696ds/dygiepp-master/dygie/data/__init__.py", line 1, in <module>
    from dygie.data.dataset_readers.dygie import DyGIEReader
  File "/ldap_shared/home/v_yuchen_zeng/696ds/dygiepp-master/dygie/data/dataset_readers/dygie.py", line 29, in <module>
    class DyGIEReader(DatasetReader):
  File "/ldap_shared/home/v_yuchen_zeng/696ds/dygiepp-master/dygie/data/dataset_readers/dygie.py", line 202, in DyGIEReader
    @overrides
  File "/ldap_shared/home/v_yuchen_zeng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/overrides/overrides.py", line 67, in overrides
    raise AssertionError('No super class method found for "%s"' % method.__name__)
AssertionError: No super class method found for "_instances_from_cache_file"

I tried emnlp-2019 branch but I still got the same problem. Have you ever had to this problem before? Thanks!

KeyError: "'' not found in vocab namespace 'scierc__ner_labels

I'm going to close this for lack of activity, feel free to reopen if not resolved.

Originally posted by @dwadden in #59 (comment)

Hi, thanks for your response. I was able to bypass the error last time and was able to extract entities on my custom dataset (wasn't able to run it on the pre-existing test.json). I am trying to run it again on a different dataset, but the error persists. I created a new environment and installed dependencies using the updated requirements.txt scratch. The stack trace is the same as above.

License?

Thank you for making the code available! It would be very useful to us, but could you maybe add a license?

Pre-processing help needed!

Hello,

Thanks for the great work! I really like it and appreciate posting your work here.

so my end goal is to predict the relationship of the dose group (drug) and adverse events using your model. eg. the text will contain the following sentences: group 3 showed decrease in food consumption. group 2 had inflammation and increase in alopecia. Group 4 exhibited sudden heart attack.

Here, the entities will be the dose group, adverse events (bolded) and the relations would be either increase, decrease, present (adverse event present)

The raw data that I have is semi-unlabeled where I will have a sentence, label entities and relations but don't have indices of where these entities & relations start and ends within a sentence. (like you mention here)

Can you please provide help in how to pre-process my raw data so that it goes into your model as an input?

Also with preprocessed data, do you think it'll work when I train Chemprot model with my new training dataset? I found that chemprot is the closest domain/model to my dataset.

Thank you so much in advance and I hope you reply back :)

Keyword extraction from scientific abstracts

Hi! I'm interested in using dygiepp for automatically generating keywords from scientific abstracts, and testing how it performs compared to Textacy algorithms for keyword extraction (for context, see Mini-Conf/Mini-Conf#34).

I'm looking to preprocess sample abstracts in https://github.com/anaerobeth/dygiepp/tree/keyword-extraction/data/miniconf/raw_data to generate suitable .txt and .ann files for use in making predictions using the pretrained scierc lightweight model. Can you provide some guidance on how to get started with this task? Thanks!

English : no such file or directory

Hi,
Where do you get the "English" file or directory for ace05.
After running get_corenlp.sh and then subsequently running get_ace05.sh I get an error:
cp: cannot stat ‘./scripts/data/ace05/common//English’: No such file or directory
run.zsh:4: no matches found: English//timex2norm/*.sgm
etc

Thanks for any help.

SciERC datset

Hi,
In the SciERC dataset (scierc.raw.tar.gz), for each abstract doc, there are three files - .txt (which is the raw text of the abstract), .ann (which captures the annotations for entities, relations, corefs). There is also a third file .xml.txt associated with each doc abstract - this seems to contain some details including the pos, parse tree/dependency information. What is this xml file and how is this to be generated? What is the process to create this if we want to bring in our own domain-specific documents to be trained on dygiee++?
Would appreciate a quick response, thanks,
Sundar

dwadden / dygiepp Goto Github PK

dygiepp's People

Contributors

Stargazers

Watchers

Forkers

dygiepp's Issues

I met a problem when I trained ChemProt model when I run "bash ./scripts/train/train_chemprot.sh -1"

I have some problem here.

Recommend Projects

Recommend Topics

Recommend Org

I met a problem when I trained ChemProt model
when I run "bash ./scripts/train/train_chemprot.sh -1"