dwadden / dygiepp Goto Github PK
View Code? Open in Web Editor NEWSpan-based system for named entity, relation, and event extraction.
License: MIT License
Span-based system for named entity, relation, and event extraction.
License: MIT License
Hi dwadden,
thank you very much for uploading your code in github.
I have succeeded in running and training your model in my env following your instruction.
I have just noticed that _normalized_word method does NOT take "self" argument.
Given the model seems working, so it is not a big deal, but just let me report this for your reference.
When I use ace05-relation.tar.gz to predict on my own dataset, the following error message occurs:
Downloading: 4%|▍ | 20.7M/501M [30:35<108:46:45, 1.23kB/s]
Downloading: 4%|▍ | 20.7M/501M [30:45<101:02:20, 1.32kB/s]
Downloading: 4%|▍ | 20.7M/501M [31:42<211:21:11, 632B/s]
Downloading: 4%|▍ | 20.7M/501M [33:21<388:42:18, 343B/s]
Downloading: 4%|▍ | 20.7M/501M [34:21<418:01:17, 319B/s]
Downloading: 4%|▍ | 20.8M/501M [34:46<355:17:32, 376B/s]
Downloading: 4%|▍ | 20.8M/501M [34:49<255:20:31, 523B/s]
Downloading: 4%|▍ | 20.8M/501M [34:52<185:06:53, 721B/s]
Downloading: 4%|▍ | 20.8M/501M [34:55<138:31:05, 963B/s]
Downloading: 4%|▍ | 20.8M/501M [34:59<106:28:06, 1.25kB/s]("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
2020-12-26 21:32:33,238 - INFO - allennlp.models.archival - removing temporary unarchived model dir at /tmp/tmporiove4f
Traceback (most recent call last):
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/urllib3/response.py", line 438, in _error_catcher
yield
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/urllib3/response.py", line 519, in read
data = self._fp.read(amt) if not fp_closed else b""
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/http/client.py", line 461, in read
n = self.readinto(b)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/http/client.py", line 505, in readinto
n = self.fp.readinto(b)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/ssl.py", line 1071, in recv_into
return self.read(nbytes, buffer)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/ssl.py", line 929, in read
return self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/requests/models.py", line 753, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/urllib3/response.py", line 576, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/urllib3/response.py", line 541, in read
raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/contextlib.py", line 130, in exit
self.gen.throw(type, value, traceback)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/urllib3/response.py", line 455, in _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/transformers/modeling_utils.py", line 926, in from_pretrained
local_files_only=local_files_only,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/transformers/file_utils.py", line 1007, in cached_path
local_files_only=local_files_only,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/transformers/file_utils.py", line 1216, in get_from_cache
http_get(url_to_download, temp_file, proxies=proxies, resume_size=resume_size, user_agent=user_agent)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/transformers/file_utils.py", line 1088, in http_get
for chunk in r.iter_content(chunk_size=1024):
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/requests/models.py", line 756, in generate
raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/wangpancheng/anaconda3/envs/dygie_ace/bin/allennlp", line 8, in
sys.exit(run())
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/main.py", line 34, in run
main(prog="allennlp")
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/commands/init.py", line 118, in main
args.func(args)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/commands/predict.py", line 205, in _predict
predictor = _get_predictor(args)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/commands/predict.py", line 110, in _get_predictor
overrides=args.overrides,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/models/archival.py", line 208, in load_archive
model = _load_model(config.duplicate(), weights_path, serialization_dir, cuda_device)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/models/archival.py", line 246, in _load_model
cuda_device=cuda_device,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/models/model.py", line 406, in load
return model_class._load(config, serialization_dir, weights_file, cuda_device)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/models/model.py", line 305, in _load
vocab=vocab, params=model_params, serialization_dir=serialization_dir
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 601, in from_params
**extras,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 629, in from_params
kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 200, in create_kwargs
cls.name, param_name, annotation, param.default, params, **extras
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 307, in pop_and_construct_arg
return construct_arg(class_name, name, popped_params, annotation, default, **extras)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 341, in construct_arg
return annotation.from_params(params=popped_params, **subextras)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 601, in from_params
**extras,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 629, in from_params
kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 200, in create_kwargs
cls.name, param_name, annotation, param.default, params, **extras
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 307, in pop_and_construct_arg
return construct_arg(class_name, name, popped_params, annotation, default, **extras)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 388, in construct_arg
**extras,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 341, in construct_arg
return annotation.from_params(params=popped_params, **subextras)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 601, in from_params
**extras,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 631, in from_params
return constructor_to_call(**kwargs) # type: ignore
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/modules/token_embedders/pretrained_transformer_mismatched_embedder.py", line 65, in init
transformer_kwargs=transformer_kwargs,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/modules/token_embedders/pretrained_transformer_embedder.py", line 79, in init
**(transformer_kwargs or {}),
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/cached_transformers.py", line 86, in get
**kwargs,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/transformers/models/auto/modeling_auto.py", line 656, in from_pretrained
pretrained_model_name_or_path, *model_args, config=config, **kwargs
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/transformers/modeling_utils.py", line 935, in from_pretrained
raise EnvironmentError(msg)
OSError: Can't load weights for 'roberta-base'. Make sure that:
'roberta-base' is a correct model identifier listed on 'https://huggingface.co/models'
or 'roberta-base' is the correct path to a directory containing a file named one of pytorch_model.bin, tf_model.h5, model.ckpt.
It seems the model will download the pretrained roberta-base model online, but because the Internet download speed is slow, the model cannot be downloaded successfully.
So I wonder how to use a locally downloaded roberta-base model? @dwadden Thanks.
Hi,
When are you going to release the code for training on ACE dataset?
Thanks
Hi,
Are you planning to release the code for training a model on the WLP corpus as well?
Thanks!
Hi, I'd like to apply DyGIE++ on the Roles Across Multiple Sentences (RAMS) dataset.
In the RAMS dataset, the event triggers and arguments may be in separate sentences. For example, the trigger could be in sentence 3, but the victim and killer is on sentence 4.
But looking at data.md, it seems like the data format is required to have the trigger and arguments in the same sentence. Is DyGIE++ capable to processing event extraction across sentences?
I cloned this repository from master(main) branch, but I saw the following error message.
=```
[internal] load build definition from Dockerfile 0.2s
=> => transferring dockerfile: 3.33kB 0.1s
=> [internal] load .dockerignore 0.2s
=> => transferring context: 443B 0.1s
=> [internal] load metadata for docker.io/pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel 6.2s
=> [auth] pytorch/pytorch:pull token for registry-1.docker.io 0.0s
=> CANCELED [ 1/29] FROM docker.io/pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel@sha256:ccebb46f954b1d32a4700aaeae0e24bd68653f92c6f276a608bf592b660b63d7 122.3s
=> => resolve docker.io/pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel@sha256:ccebb46f954b1d32a4700aaeae0e24bd68653f92c6f276a608bf592b660b63d7 0.0s
=> => sha256:bb833e4d631feff31ab57559d64617ad895d3ae7f45fdb651f9ba2df50b183b7 10.06kB / 10.06kB 0.0s
=> => sha256:8c3b70e3904492c753652606df4726430426f42ea56e06ea924d6fea7ae162a1 845B / 845B 0.0s
=> => sha256:ee9b457b77d047ff322858e2de025e266ff5908aec569560e77e2e4451fc23f4 184B / 184B 0.0s
=> => sha256:ccebb46f954b1d32a4700aaeae0e24bd68653f92c6f276a608bf592b660b63d7 3.05kB / 3.05kB 0.0s
=> => sha256:c1bbdc448b7263673926b8fe2e88491e5083a8b4b06ddfabf311f2fc5f27e2ff 35.36kB / 35.36kB 0.0s
=> => sha256:7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c 26.69MB / 26.69MB 0.0s
=> => sha256:45d437916d5781043432f2d72608049dcf74ddbd27daa01a25fa63c8f1b9adc4 162B / 162B 0.0s
=> => sha256:d8f1569ddae616589c5a2dabf668fadd250ee9d89253ef16f0cb0c8a9459b322 7.22MB / 7.22MB 0.0s
=> => sha256:85386706b02069c58ffaea9de66c360f9d59890e56f58485d05c1a532ca30db1 8.45MB / 8.45MB 0.0s
=> => sha256:be4f3343ecd31ebf7ec8809f61b1d36c2c2f98fc4e63582401d9108575bc443a 77.59MB / 688.74MB 123.2s
=> => sha256:30b4effda4fdab95ec4eba8873f86e7574c2edddf4dc5df8212e3eda1545aafa 90.18MB / 820.84MB 123.2s
=> => sha256:b398e882f4149bf61faa8f2c1d47a4fe98b8fe1b2c9379da1d58ddc54fe67cf0 110.10MB / 532.41MB 123.2s
=> => extracting sha256:7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c 15.6s
=> => extracting sha256:c1bbdc448b7263673926b8fe2e88491e5083a8b4b06ddfabf311f2fc5f27e2ff 0.0s
=> => extracting sha256:8c3b70e3904492c753652606df4726430426f42ea56e06ea924d6fea7ae162a1 0.0s
=> => extracting sha256:45d437916d5781043432f2d72608049dcf74ddbd27daa01a25fa63c8f1b9adc4 0.0s
=> => extracting sha256:d8f1569ddae616589c5a2dabf668fadd250ee9d89253ef16f0cb0c8a9459b322 5.3s
=> => extracting sha256:85386706b02069c58ffaea9de66c360f9d59890e56f58485d05c1a532ca30db1 2.7s
=> => extracting sha256:ee9b457b77d047ff322858e2de025e266ff5908aec569560e77e2e4451fc23f4 0.0s
=> [internal] load build context 122.3s
=> => transferring context: 903.71MB 122.1s
=> CACHED [ 2/29] RUN mkdir /dygiepp 0.0s
=> CACHED [ 3/29] RUN apt-get update && apt-get -y install gcc make sqlite3 0.0s
=> CACHED [ 4/29] RUN conda create --name dygiepp python=3.7 -y 0.0s
=> CACHED [ 5/29] RUN conda install -c conda-forge jsonnet -y 0.0s
=> CACHED [ 6/29] COPY requirements.txt /tmp/requirements.txt 0.0s
=> CACHED [ 7/29] RUN pip install -r /tmp/requirements.txt 0.0s
=> CACHED [ 8/29] RUN conda create --name ace-event-preprocess python=3.7 -y 0.0s
=> CACHED [ 9/29] COPY scripts/data/ace-event/requirements.txt /tmp/ace-prep-requirements.txt 0.0s
=> CACHED [10/29] RUN pip install -r /tmp/ace-prep-requirements.txt 0.0s
=> CACHED [11/29] RUN python -m spacy download en 0.0s
=> CACHED [12/29] RUN apt-get install openjdk-8-jdk openjdk-8-jre wget unzip -y 0.0s
=> CACHED [13/29] COPY scripts/data/ace05/get_corenlp.sh /tmp/get_corenlp.sh 0.0s
=> CACHED [14/29] RUN cd /dygiepp/ && bash /tmp/get_corenlp.sh 0.0s
=> CACHED [15/29] RUN conda install -c conda-forge zsh -y 0.0s
=> CACHED [16/29] RUN apt-get install unzip wget -y 0.0s
=> CACHED [17/29] COPY scripts/data/shared /dygiepp/scripts/data/shared 0.0s
=> CACHED [18/29] COPY scripts/data/get_scierc.sh /tmp/get_scierc.sh 0.0s
=> CACHED [19/29] COPY dygie /dygiepp/dygie 0.0s
=> CACHED [20/29] RUN cd /dygiepp && bash /tmp/get_scierc.sh 0.0s
=> CACHED [21/29] RUN apt-get install wget -y 0.0s
=> CACHED [22/29] COPY scripts/pretrained/get_dygiepp_pretrained.sh /tmp/get_dygiepp_pretrained.sh 0.0s
=> CACHED [23/29] RUN cd /dygiepp && bash /tmp/get_dygiepp_pretrained.sh 0.0s
=> ERROR [24/29] COPY scripts/pretrained/get_scibert.py /tmp/get_scibert.py 0.0s
[24/29] COPY scripts/pretrained/get_scibert.py /tmp/get_scibert.py:
failed to compute cache key: "/scripts/pretrained/get_scibert.py" not found: not found
Looking forward to hearing from you soon!
Hi
Thank you for your amazing work and for publishing the code!
While replicating your work on making predictions on the existing dataset I encountered the following error: can you please help me out?
allennlp predict ./scripts/pretrained/genia-lightweight.tar.gz \ ./scripts/processed_data/json-coref-ident-only/test.json \ --predictor dygie \ --include-package dygie \ --use-dataset-reader \ --output-file predictions/genia-test.jsonl \ --cuda-device 0
Thank you!
In the documentation, there is two dataset preprocessing steps. One for entity and relations and the second one is for events. In the first task, Stanford Corenlp is used, but in the second task, Spacy is used. Can you please explain, what is the difference? I see relation labels are different in these preprocessing steps, such as, "ORG-AFF.Membership" or "GEN-AFF" and their offset values are different too. There are other differences too. It would be helpful if you provide some details.
Since ACE05 is a benchmark dataset, I assume, token/entity/relation/event annotation is already there. Then why do you need Corenlp or Spacy libraries?
Hello,
Thanks for great work and paper!
I'm trying to reproduce SciERC results, but I'm getting only:
"best_validation__rel_f1": 0.33248730964467005,
"best_validation_ner_f1": 0.6711409395972655,
Paper is reporting 48.4 F1 score on relations extraction on SciERC.
My questions are:
Bellow I'm placing full installation step-by-step, log tail, system info and pip list:
Installation:
conda create --name dygie python=3.7
conda activate dygie
conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch
pip install allennlp
pip install botocore
git clone https://github.com/dwadden/dygiepp.git
cd dygiepp/
pip install -r requirements.txt
bash ./scripts/data/get_scierc.sh
bash ./scripts/train/train_scierc.sh 0
Log tail:
[...]
2019-11-13 16:08:07,639 - WARNING - root - NaN or Inf found in input tensor.
2019-11-13 16:08:07,640 - WARNING - root - NaN or Inf found in input tensor.
ner_precision: 0.9754, ner_recall: 0.9725, ner_f1: 0.9740, loss: 4.5208 ||: 100%|#########9| 499/500 [01:05<00:00, 8.05it/s]2019-11-13 16:08:20,732 - WARNING - root - NaN or Inf found in input tensor.
2019-11-13 16:08:20,732 - WARNING - root - NaN or Inf found in input tensor.
2019-11-13 16:08:20,734 - WARNING - root - NaN or Inf found in input tensor.
2019-11-13 16:08:20,734 - WARNING - root - NaN or Inf found in input tensor.
2019-11-13 16:08:20,740 - WARNING - root - NaN or Inf found in input tensor.
2019-11-13 16:08:20,740 - WARNING - root - NaN or Inf found in input tensor.
2019-11-13 16:08:20,742 - WARNING - root - NaN or Inf found in input tensor.
2019-11-13 16:08:20,742 - WARNING - root - NaN or Inf found in input tensor.
ner_precision: 0.9755, ner_recall: 0.9726, ner_f1: 0.9740, loss: 4.5118 ||: 100%|##########| 500/500 [01:05<00:00, 7.62it/s]
2019-11-13 16:08:20,746 - INFO - allennlp.training.trainer - Validating
ner_precision: 0.6449, ner_recall: 0.6873, ner_f1: 0.6655, loss: 232.1718 ||: 100%|##########| 50/50 [00:02<00:00, 21.36it/s]
2019-11-13 16:08:23,090 - INFO - allennlp.training.trainer - Ran out of patience. Stopping training.
2019-11-13 16:08:23,090 - INFO - allennlp.training.checkpointer - loading best weights
2019-11-13 16:08:23,219 - INFO - allennlp.commands.train - To evaluate on the test set after training, pass the 'evaluate_on_test' flag, or use the 'allennlp evaluate' command.
2019-11-13 16:08:23,219 - INFO - allennlp.models.archival - archiving weights and vocabulary to ./models/scierc/model.tar.gz
2019-11-13 16:08:40,237 - INFO - allennlp.common.util - Metrics: {
"best_epoch": 23,
"peak_cpu_memory_MB": 3783.14,
"peak_gpu_0_memory_MB": 8654,
"training_duration": "0:43:55.006665",
"training_start_epoch": 0,
"training_epochs": 37,
"epoch": 37,
"training__coref_precision": 0.8032079153669152,
"training__coref_recall": 0.52068901932887,
"training__coref_f1": 0.6030049906804981,
"training__coref_mention_recall": 0.9909547738693467,
"training_ner_precision": 0.9815991970558715,
"training_ner_recall": 0.9786524349566378,
"training_ner_f1": 0.9801236011357441,
"training__rel_precision": 0.8399269628727937,
"training__rel_recall": 0.8051341890315052,
"training__rel_f1": 0.8221626452189454,
"training__rel_span_recall": 0.837222870478413,
"training__trig_id_precision": 0,
"training__trig_id_recall": 0,
"training__trig_id_f1": 0,
"training__trig_class_precision": 0,
"training__trig_class_recall": 0,
"training__trig_class_f1": 0,
"training__arg_id_precision": 0,
"training__arg_id_recall": 0,
"training__arg_id_f1": 0,
"training__arg_class_precision": 0,
"training__arg_class_recall": 0,
"training__arg_class_f1": 0,
"training__args_multiple": 0,
"training_loss": 4.156547112312106,
"training_cpu_memory_MB": 3783.14,
"training_gpu_0_memory_MB": 8654,
"validation__coref_precision": 0.5785864145810381,
"validation__coref_recall": 0.40000249926139114,
"validation__coref_f1": 0.47185963848741963,
"validation__coref_mention_recall": 0.9338235294117647,
"validation_ner_precision": 0.6489988221436984,
"validation_ner_recall": 0.6836228287841191,
"validation_ner_f1": 0.6658610271902824,
"validation__rel_precision": 0.3944954128440367,
"validation__rel_recall": 0.378021978021978,
"validation__rel_f1": 0.3860830527497194,
"validation__rel_span_recall": 0.4175824175824176,
"validation__trig_id_precision": 0,
"validation__trig_id_recall": 0,
"validation__trig_id_f1": 0,
"validation__trig_class_precision": 0,
"validation__trig_class_recall": 0,
"validation__trig_class_f1": 0,
"validation__arg_id_precision": 0,
"validation__arg_id_recall": 0,
"validation__arg_id_f1": 0,
"validation__arg_class_precision": 0,
"validation__arg_class_recall": 0,
"validation__arg_class_f1": 0,
"validation__args_multiple": 0,
"validation_loss": 223.13119384765625,
"best_validation__coref_precision": 0.5206815500675314,
"best_validation__coref_recall": 0.39612618358360496,
"best_validation__coref_f1": 0.44944354440112394,
"best_validation__coref_mention_recall": 0.9375,
"best_validation_ner_precision": 0.6602641056422568,
"best_validation_ner_recall": 0.6823821339950371,
"best_validation_ner_f1": 0.6711409395972655,
"best_validation__rel_precision": 0.3933933933933934,
"best_validation__rel_recall": 0.2879120879120879,
"best_validation__rel_f1": 0.33248730964467005,
"best_validation__rel_span_recall": 0.3208791208791209,
"best_validation__trig_id_precision": 0,
"best_validation__trig_id_recall": 0,
"best_validation__trig_id_f1": 0,
"best_validation__trig_class_precision": 0,
"best_validation__trig_class_recall": 0,
"best_validation__trig_class_f1": 0,
"best_validation__arg_id_precision": 0,
"best_validation__arg_id_recall": 0,
"best_validation__arg_id_f1": 0,
"best_validation__arg_class_precision": 0,
"best_validation__arg_class_recall": 0,
"best_validation__arg_class_f1": 0,
"best_validation__args_multiple": 0,
"best_validation_loss": 142.15479759216308
}
System:
Ubuntu 18.04.3 LTS
GPU 0: GeForce RTX 2080 Ti
Driver Version: 418.88
CUDA Version: 10.1
pip list:
$ pip list
Package Version
----------------------------- -------------------
alabaster 0.7.12
allennlp 0.9.0
atomicwrites 1.3.0
attrs 19.3.0
Babel 2.7.0
beautifulsoup4 4.8.1
blis 0.2.4
boto3 1.10.16
botocore 1.13.16
certifi 2019.9.11
cffi 1.13.1
chardet 3.0.4
Click 7.0
conllu 1.3.1
cycler 0.10.0
cymem 2.0.2
docutils 0.15.2
editdistance 0.5.3
flaky 3.6.1
Flask 1.1.1
Flask-Cors 3.0.8
ftfy 5.6
gevent 1.4.0
greenlet 0.4.15
h5py 2.10.0
idna 2.8
imagesize 1.1.0
importlib-metadata 0.23
itsdangerous 1.1.0
Jinja2 2.10.3
jmespath 0.9.4
joblib 0.14.0
jsonnet 0.14.0
jsonpickle 1.2
kiwisolver 1.1.0
lxml 4.4.1
MarkupSafe 1.1.1
matplotlib 3.1.1
mkl-fft 1.0.15
mkl-random 1.1.0
mkl-service 2.3.0
more-itertools 7.2.0
murmurhash 1.0.2
nltk 3.4.5
numpy 1.17.3
numpydoc 0.9.1
olefile 0.46
overrides 2.5
packaging 19.2
pandas 0.25.3
parsimonious 0.8.1
Pillow 6.2.1
pip 19.3.1
plac 0.9.6
pluggy 0.13.0
preshed 2.0.1
protobuf 3.10.0
py 1.8.0
pycparser 2.19
Pygments 2.4.2
pyparsing 2.4.5
pytest 5.2.2
python-dateutil 2.8.0
python-Levenshtein 0.12.0
pytorch-pretrained-bert 0.6.2
pytorch-transformers 1.1.0
pytz 2019.3
regex 2019.11.1
requests 2.22.0
responses 0.10.6
s3transfer 0.2.1
scikit-learn 0.21.3
scipy 1.3.2
sentencepiece 0.1.83
setuptools 41.6.0.post20191030
six 1.13.0
snowballstemmer 2.0.0
soupsieve 1.9.5
spacy 2.1.9
Sphinx 2.2.1
sphinxcontrib-applehelp 1.0.1
sphinxcontrib-devhelp 1.0.1
sphinxcontrib-htmlhelp 1.0.2
sphinxcontrib-jsmath 1.0.1
sphinxcontrib-qthelp 1.0.2
sphinxcontrib-serializinghtml 1.1.3
sqlparse 0.3.0
srsly 0.2.0
tensorboardX 1.9
thinc 7.0.8
torch 1.2.0
torchvision 0.4.0a0+6b959ee
tqdm 4.38.0
Unidecode 1.1.1
urllib3 1.25.7
wasabi 0.4.0
wcwidth 0.1.7
Werkzeug 0.16.0
wheel 0.33.6
word2number 1.1
zipp 0.6.0
I want to use my dataset, my data include *.ann and *.txt. How to convert data into project input data? use tools? I hope you can give me some advice.
Hi,
I'm trying to apply the SciERC pre-trained model to an unlabeled dataset of abstracts from plant science papers. I used the following command line to format my code:
python scripts/new-dataset/format_new_dataset.py ../knowledge-graph/data/first_manuscript_data/clustering_pipeline_output/JA_GA_chosen_abstracts/ ../knowledge-graph/data/first_manuscript_data/dygiepp/prepped_data/dygiepp_formatted_data_SciERC.jsonl scierc
where the directory JA_GA_chosen_abstracts
contains a .txt
file for each abstract.
I was then successfully able to run the pre-trained SciERC model on this data. However, when looking at the results, I noticed I was getting a lot of entities that were either a single round bracket, a single hyphen, or a word followed or preceded by a hyphen. When I looked more closely at the tokenized sentences in the preprocessed jsonl
file, it was clear that this is because the spaCy tokenizer in ./scripts/new-dataset/format_new_dataset.py
splits hyphenated words, and leaves parentheses/brackets as-is.
However, when I looked at the processed SciERC json
files, it looks like they were tokenized with PTB3 token transforms ("(" becomes "-LRB-", etc.), and without splitting hyphenated words. A cursory google makes it seem like this tokenization may have been done with the Stanford NLP tokenizer, because it gives options to use PTB3 tranforms and to not split hyphenated words.
I checked out the webpage where the processed SciERC dataset is pulled from, and skimmed the paper and the repo, but didn't see anything that indicated to me how the dataset was tokenized. was wondering if you knew what tokenizer had been used on the SciERC data, and if you thought it would be better to use the same tokenization scheme on new datasets to get better performance with the pre-trained model. If it turns out it was done with the Stanford NLP tokenizer, I'd be more than happy to open a PR adding an option to use that tokenizer in format_new_dataset.py
.
Thanks!
Hi,
I am training on a document of 40 sentences. some are long.
Do you have some advices to lower the GPU memory use (e.g. batch size (I cannot find the params as I am not familiar with allennlp) and sentence length)
I got out of memory in a 32GB Nvidia GPU.
Thanks
2020-05-27 14:17:59,641 - INFO - allennlp.training.optimizers - Number of trainable parameters: 123007840
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.optimizer.infer_type_and_cast = True
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently.
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS:
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.optimizer.lr = 0.001
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.optimizer.t_total = 10000
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.optimizer.warmup = 0.1
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.optimizer.weight_decay = 0
2020-05-27 14:17:59,642 - INFO - allennlp.common.registrable - instantiating registered subclass bert_adam of <class 'allennlp.training.optimizers.Optimizer'>
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.learning_rate_scheduler.type = reduce_on_plateau
2020-05-27 14:17:59,642 - INFO - allennlp.common.registrable - instantiating registered subclass reduce_on_plateau of <class 'allennlp.training.learning_rate_schedulers.learning_rate_scheduler.LearningRateScheduler'>
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently.
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS:
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.learning_rate_scheduler.factor = 0.5
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.learning_rate_scheduler.mode = max
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.learning_rate_scheduler.patience = 4
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.num_serialized_models_to_keep = 3
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.keep_serialized_model_every_num_seconds = None
2020-05-27 14:17:59,643 - INFO - allennlp.common.params - trainer.model_save_interval = None
2020-05-27 14:17:59,643 - INFO - allennlp.common.params - trainer.summary_interval = 100
2020-05-27 14:17:59,643 - INFO - allennlp.common.params - trainer.histogram_interval = None
2020-05-27 14:17:59,643 - INFO - allennlp.common.params - trainer.should_log_parameter_statistics = True
2020-05-27 14:17:59,643 - INFO - allennlp.common.params - trainer.should_log_learning_rate = False
2020-05-27 14:17:59,643 - INFO - allennlp.common.params - trainer.log_batch_size_period = None
2020-05-27 14:17:59,806 - INFO - allennlp.training.trainer - Beginning training.
2020-05-27 14:17:59,806 - INFO - allennlp.training.trainer - Epoch 0/249
2020-05-27 14:17:59,806 - INFO - allennlp.training.trainer - Peak CPU memory usage MB: 2505.212
2020-05-27 14:17:59,891 - INFO - allennlp.training.trainer - Training
0%| | 0/1299 [00:00<?, ?it/s]./scripts/train/train_chemprot.sh: line 18: 6835 Killed ie_train_data_path=$data_root/training.jsonl ie_dev_data_path=$data_root/development.jsonl ie_test_data_path=$data_root/test.jsonl cuda_device=$cuda_device allennlp train $config_file --cache-directory $data_root/cached --serialization-dir ./models/$experiment_name --include-package dygie
I think the reason for this problem is that the memmory is too small. I would like to hear your advice.
my cloud server:
2vCPUs | 4GB
Hi,
I emailed you earlier regarding this. To make this more official, I Was wondering if you had any suggestions on how to finetune a pretrained model.
I.e. I have a set of annotated articles. I would love to use this data to finetune the ace-event model.
Thanks!
My custom dataset has ACE-like event annotations with triggers and arguments but no NER or relations.
I tried running the ACE training pipeline on my event data and left the relations
and ner
keys empty, e.g.:
{
"doc_key": "aal00",
"sentences": [
["American", "Airlines", "Up", "on", "Record", "April", "Traffic", ",", "Upbeat", "Q2", "View"],
["Premier", "passenger", "carrier", ",", "American", "Airlines", "Group", "Inc", ".", "AAL", "saw", "its", "shares", "rise", "4.76", "%", "to", "$", "47.08", "at", "the", "close", "of", "business", "on", "Apr", "9", ",", "following", "the", "release", "of", "its", "traffic", "report", "for", "the", "month", "of", "April", "."]
],
"events": [
[ ],
[
[[24, "SecurityValue"], [25, 26, "IncreaseAmount"], [23, 23, "Security"], [30, 37, "TIME"], [28, 29, "Price"]],
[[45, "FinancialReport"], [43, 43, "Reportee"], [46, 50, "TIME"]]
]
],
"ner": [
[ ],
[ ]
],
"relations": [
[ ],
[ ]
],
"clusters": [
[ ],
[ ]
]
}
I configured the .jsonnet file based off train_ace05_event.jsonnet
changing
n_trigger_labels
to the amount of event types and n_ner_labels: 0,
because I have no NER annotations.
Those were all the relevant config keys I could identify from the file itself, I probably missed some because training failed with the following error:
2020-08-06 14:18:05,035 - INFO - allennlp.training.trainer - Training
0%| | 0/365 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/opt/conda/envs/dygiepp/bin/allennlp", line 8, in <module>
sys.exit(run())
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
main(prog="allennlp")
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 102, in main
args.func(args)
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 124, in train_model_from_args
args.cache_prefix)
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 168, in train_model_from_file
cache_directory, cache_prefix)
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 252, in train_model
metrics = trainer.train()
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 478, in train
train_metrics = self._train_epoch(epoch)
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 320, in _train_epoch
loss = self.batch_loss(batch_group, for_training=True)
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 261, in batch_loss
output_dict = self.model(**batch)
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "./dygie/models/dygie.py", line 298, in forward
ner_labels, metadata)
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "./dygie/models/events.py", line 267, in forward
trig_arg_embeddings, top_trig_scores, top_arg_scores, top_arg_mask)
File "./dygie/models/events.py", line 553, in _compute_argument_scores
embeddings_flat = pairwise_embeddings.view(-1, feature_dim)
RuntimeError: shape '[-1, 2642]' is invalid for input of size 17840250
I encounter the problem KeyError: 'None__ner_labels', when I try to use dygiepp to predict on new dataset. The following is the detail:
Traceback (most recent call last):
File "/home/wangpancheng/anaconda3/envs/dygiepp/bin/allennlp", line 8, in
sys.exit(run())
File "/home/wangpancheng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/main.py", line 34, in run
main(prog="allennlp")
File "/home/wangpancheng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/init.py", line 118, in main
args.func(args)
File "/home/wangpancheng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/predict.py", line 220, in _predict
manager.run()
File "/home/wangpancheng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/predict.py", line 187, in run
for model_input_instance, result in zip(batch, self._predict_instances(batch)):
File "/home/wangpancheng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/predict.py", line 146, in _predict_instances
results = [self._predictor.predict_instance(batch_data[0])]
File "/serverData2/wangpancheng/wpc/dygiepp/dygie/predictors/dygie.py", line 56, in predict_instance
prediction = model.make_output_human_readable(model(**model_input)).to_json()
File "/home/wangpancheng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/serverData2/wangpancheng/wpc/dygiepp/dygie/models/dygie.py", line 239, in forward
spans, span_mask, span_embeddings, sentence_lengths, ner_labels, metadata)
File "/home/wangpancheng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/serverData2/wangpancheng/wpc/dygiepp/dygie/models/ner.py", line 90, in forward
scorer = self._ner_scorers[self._active_namespace]
File "/home/wangpancheng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/torch/nn/modules/container.py", line 286, in getitem
return self._modules[key]
KeyError: 'None__ner_labels'
Hi,
thank you very much for providing your source code!
I know that training on ace events is wip but if you would push your local template we could try to figure out how to train in the meantime
Hi,
I would use dygiepp on the dutch language.
It means I should use other embeddings than Bert. There is a Dutch version 'Bertje'.
I would be grateful if you could give me a clue where to start to adapt in your script, or in allennlp to use this Dutch version.
Thanks a lot
Currently, the data preprocessing step only process the English data. Is it possible to process the Chinese and Arabic portion of the ACE05 dataset?
Hello there!
Can you recommend a way to deploy Dygiepp as a REST API.
Thanks,
NR
I have not yet dived deep into the model code. I ran the allennlp predict as mentioned in the docs. It printed the inferences in the terminal. So is there any way to predict the relations,events etc from the shell.
A demo in Jupyter Notebook will be very useful.
Using the following code:
import json
import sys
from allennlp.commands import main
overrides = json.dumps({"trainer": {"cuda_device": -1}})
runs = [
( "../ace05_best_relation_bert.jsonnet", "../data/relation" )
]
for config_file, serialization_dir in runs:
sys.argv = [
"allennlp",
"train",
config_file,
"-s", serialization_dir,
"--include-package", "dygie",
"--include-package", "ie_json",
"-o", overrides,
]
main()
Attempting to run ace05_best_ner_bert.jsonnet
returns the error:
I'm using the versions of allennlp
and torch
suggested in requirements.txt:
I am testing the branch #allennlp-v1 to run the ACE-Event training code which has some code issues regarding imports.
(dygiepp) root@0a587743b213:/dygiepp# rm -rf ./models/ace05-event; bash ./scripts/train/train_ace05_event.sh 0
2020-08-03 08:58:29,323 - INFO - transformers.file_utils - PyTorch version 1.5.1 available.
Traceback (most recent call last):
File "/opt/conda/envs/dygiepp/bin/allennlp", line 8, in <module>
sys.exit(run())
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/__main__.py", line 19, in run
main(prog="allennlp")
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 91, in main
import_module_and_submodules(package_name)
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/common/util.py", line 351, in import_module_and_submodules
import_module_and_submodules(subpackage)
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/common/util.py", line 340, in import_module_and_submodules
module = importlib.import_module(package_name)
File "/opt/conda/envs/dygiepp/lib/python3.7/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 728, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/dygiepp/dygie/data/__init__.py", line 3, in <module>
from dygie.data.iterators.batch_iterator import BatchIterator
File "/dygiepp/dygie/data/iterators/batch_iterator.py", line 10, in <module>
from allennlp.data.dataloader import DataLoader, PyTorchDataLoader
ImportError: cannot import name 'PyTorchDataLoader' from 'allennlp.data.dataloader' (/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/data/dataloader.py)
(dygiepp) root@0a587743b213:/dygiepp# python -c "from allennlp.data.dataloader import PyTorchDataLoader"
Traceback (most recent call last):
File "<string>", line 1, in <module>
ImportError: cannot import name 'PyTorchDataLoader' from 'allennlp.data.dataloader' (/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/data/dataloader.py)
I also tested the branch two weeks ago at commit 07074fd and encountered multiple issues with sen_dict() missing in the dataloader code, indicating refactoring is still ongoing.
Is the allennlp-v1 branch meant to be used now or will it be merged to master when ready to use?
I am planning on making adjustments to dygiepp to run on a my own ERE-like event dataset and the upgraded dependencies would make Apex usable which is a big plus for me.
In any case, thanks for making DYGIE++ source available and maintaining it!
Thanks for making a comprehensive summarization of Event Extraction pre-processing.
I am confused by the number of documents for each split. File dev.filelist, test.filelist and train.filelist have 28, 40, 529 lines, which in total makes 597 documents, but Table 8 in the Appendix depicts there are 599 documents.
Hi
I tried to train and test with a new dataset that I formatted similar to ace event data
However, I keep having the next error
File "/home/thierry/repos/dygiepp-dev/dygie/models/events.py", line 177, in forward
mention_pruner = self._mention_pruners[self._active_namespaces["argument"]]
File "/home/thierry/miniconda3/lib/python3.8/site-packages/torch/nn/modules/container.py", line 286, in getitem
return self._modules[key]
I assume I should specify the used labels somewhere ?
Thanks for helping
In the paper you mention as ACE does not have coreference annotation, you use OntoNotes for coreference propagation. I was wondering if it is possible to predict coreference on ACE using the current trained model?
Hi, when I ran the preprocessing script parse_ace_event.py, I got the Exception: Should not get here from the get_token_of function. I found that it was the char index couldn't match the token index which may due to the tokenization. Have you ever met such problem, and how should i fix the problem?
All models tests are failing with message: allennlp.common.checks.ConfigurationError: 'key "span_emb_dim" is required at location "coref."'
(dygie) ~/ml/dygiepp/dygie(debugging) $ python -mpytest tests
================================================= test session starts ==================================================
platform linux -- Python 3.7.5, pytest-5.2.2, py-1.8.0, pluggy-0.13.0
rootdir: /home/konrad/ml/dygiepp/dygie, inifile: pytest.ini
plugins: flaky-3.6.1
collected 15 items
tests/data/ie_json_test.py ........ [ 53%]
tests/models/coref_test.py F [ 60%]
tests/models/dygie_test.py F [ 66%]
tests/models/relation_test.py FFFFF [100%]
[...]
2020-06-11 21:22:09,219 - INFO - allennlp.training.trainer - Beginning training.
2020-06-11 21:22:09,219 - INFO - allennlp.training.trainer - Epoch 0/249
2020-06-11 21:22:09,219 - INFO - allennlp.training.trainer - Peak CPU memory usage MB: 3337.832
2020-06-11 21:22:09,275 - INFO - allennlp.training.trainer - Training
0%| | 0/1990 [00:03<?, ?it/s]
Traceback (most recent call last):
File "/root/miniconda3/envs/dygiepp/bin/allennlp", line 8, in
sys.exit(run())
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
main(prog="allennlp")
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/init.py", line 102, in main
args.func(args)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 124, in train_model_from_args
args.cache_prefix)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 168, in train_model_from_file
cache_directory, cache_prefix)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 252, in train_model
metrics = trainer.train()
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 478, in train
train_metrics = self._train_epoch(epoch)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 320, in _train_epoch
loss = self.batch_loss(batch_group, for_training=True)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 261, in batch_loss
output_dict = self.model(**batch)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "./dygie/models/dygie.py", line 280, in forward
output_relation = self._relation.predict_labels(relation_labels, output_relation, metadata)
File "./dygie/models/relation.py", line 193, in predict_labels
predictions = self.decode(output_dict)["decoded_relations_dict"]
File "./dygie/models/relation.py", line 218, in decode
top_spans, predicted_relations, num_spans_to_keep)
File "./dygie/models/relation.py", line 249, in _decode_sentence
label_name = self.vocab.get_token_from_index(label, namespace="relation_labels")
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/data/vocabulary.py", line 644, in get_token_from_index
return self._index_to_token[namespace][index]
KeyError: 0
run "pip install -r requirements.txt" failed
ERROR: Failed building wheel for jsonnet
Running setup.py clean for jsonnet
Successfully built en-core-sci-sm
Failed to build jsonnet
Installing collected packages: unidecode, itsdangerous, Werkzeug, flask, flask-cors, parsimonious, sqlparse, jmespath, botocore, s3transfer, boto3, pytorch-pretrained-bert, jsonnet, h5py, overrides, conllu, threadpoolctl, scipy, scikit-learn, editdistance, greenlet, gevent, protobuf, tensorboardX, wcwidth, ftfy, zipp, importlib-metadata, jsonpickle, blis, cymem, preshed, murmurhash, plac, srsly, wasabi, thinc, spacy, sentencepiece, pytorch-transformers, responses, attrs, py, more-itertools, pluggy, pytest, allennlp, pandas, soupsieve, beautifulsoup4, lxml, python-Levenshtein, PyYAML, pyasn1, rsa, colorama, awscli, pybind11, psutil, nmslib, scispacy, en-core-sci-sm
Running setup.py install for jsonnet ... error
ERROR: Command errored out with exit status 1:
command: /root/miniconda3/envs/dygiepp/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-fout_ucm/jsonnet/setup.py'"'"'; file='"'"'/tmp/pip-install-fout_ucm/jsonnet/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-nukhhlr0/install-record.txt --single-version-externally-managed --compile --install-headers /root/miniconda3/envs/dygiepp/include/python3.7m/jsonnet
cwd: /tmp/pip-install-fout_ucm/jsonnet/
Complete output (35 lines):
running install
running build
running build_ext
g++ -c -g -O3 -Wall -Wextra -Woverloaded-virtual -pedantic -std=c++0x -fPIC -Iinclude -Ithird_party/md5 -Ithird_party/json core/desugarer.cpp -o core/desugarer.o
make: g++: Command not found
make: *** [core/desugarer.o] Error 127
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-install-fout_ucm/jsonnet/setup.py", line 75, in
test_suite="python._jsonnet_test",
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/setuptools/init.py", line 144, in setup
return distutils.core.setup(**attrs)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/core.py", line 148, in setup
dist.run_commands()
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/setuptools/command/install.py", line 61, in run
return orig.install.run(self)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/command/install.py", line 545, in run
self.run_command('build')
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/command/build.py", line 135, in run
self.run_command(cmd_name)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/tmp/pip-install-fout_ucm/jsonnet/setup.py", line 54, in run
raise Exception('Could not build %s' % (', '.join(LIB_OBJECTS)))
Exception: Could not build core/desugarer.o, core/formatter.o, core/libjsonnet.o, core/lexer.o, core/parser.o, core/pass.o, core/static_analysis.o, core/string_utils.o, core/vm.o, third_party/md5/md5.o
----------------------------------------
ERROR: Command errored out with exit status 1: /root/miniconda3/envs/dygiepp/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-fout_ucm/jsonnet/setup.py'"'"'; file='"'"'/tmp/pip-install-fout_ucm/jsonnet/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-nukhhlr0/install-record.txt --single-version-externally-managed --compile --install-headers /root/miniconda3/envs/dygiepp/include/python3.7m/jsonnet Check the logs for full command output.
why the param in code TimeDistributed(torch.nn.Linear(mention_feedforward.get_output_dim(), self._n_labels - 1)) is self._n_labels - 1 , should it be self._n_labels ?
And on the other hand,why do you design dummy_scores in ner.py ?
First, thanks for all your great work!
I've put my own labeled data in the DyGIE++ format, following the instructions in (https://github.com/dwadden/dygiepp/blob/master/DATA.md). I was able to successfully train the Relation extractor fine using the ace05_best_relation_bert.jsonnet
config, with quite good performance.
When I next tried to train for Events using ace05_event.jsonnet
, however, I ran into the following error:
Following the comments in https://github.com/dwadden/dygiepp/blob/master/training_config/template_dw.libsonnet#L88, I then made a copy of ace05_event.jsonnet
, changing the n_trigger_labels
and n_ner_labels
values to reflect my own dataset counts, ie:
ace05_event_copy.jsonnet
n_trigger_labels: 28, // prev: 34
n_ner_labels: 103, // prev: 8
However I'm still getting the error, RuntimeError: shape '[-1, 2745]' is invalid for input of size 5601840
in https://github.com/dwadden/dygiepp/blob/master/dygie/models/events.py#L553.
I wanted to replace the Pre-trained SciBERT model with a Fine Tuned SciBERT Model. I achieved the Language Model Fine Tuning via this blog: https://github.com/Nikoschenk/language_model_finetuning/blob/master/scibert_fine_tuner.ipynb.
It uses the HuggingFace Library to achieve the Fine Tuning. The resulting HuggingFace Model has these components:
Config.json
Pytorch_model.bin
Special_tokens_map.json
Tokenizer_config.json
Vocab.txt
I noticed that when I run Dygie’s get_scibert.py script, it downloads the Pytorch model as follows:
scibert_scivocab_cased/weights.tar.gz
scibert_scivocab_cased/vocab.txt
Further, weights.tar.gz, is made up of pytorch_model.bin & bert_config.json.
I repackaged HuggingFace outputs into the new weights.tar.gz (pytorch_model.bin & config.json renamed as bert_config.json).
It works fine.
I just wanted a second opinion, about the above approach. Pls advise.
I am trying to run pretrained sciERC model on preprocessed sciERC dataset. I run the following command to do so:
allennlp predict pretrained/scierc.tar.gz \ data/processed_data/json/test.json \ --predictor dygie \ --include-package dygie \ --use-dataset-reader \ --output-file predictions/scierc-test.jsonl \ --cuda-device -1 \ --silent
I run into the following error:
Traceback (most recent call last):
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/data/vocabulary.py", line 724, in get_token_index
return self._token_to_index[namespace][token]
KeyError: ''
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/data/vocabulary.py", line 727, in get_token_index
return self._token_to_index[namespace][self._oov_token]
KeyError: '@@unknown@@'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/yelman/anaconda3/envs/dygiepp/bin/allennlp", line 8, in
sys.exit(run())
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/main.py", line 34, in run
main(prog="allennlp")
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/init.py", line 119, in main
args.func(args)
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/predict.py", line 224, in _predict
predictor = _get_predictor(args)
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/predict.py", line 119, in _get_predictor
overrides=args.overrides,
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/models/archival.py", line 208, in load_archive
model = _load_model(config.duplicate(), weights_path, serialization_dir, cuda_device)
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/models/archival.py", line 246, in _load_model
cuda_device=cuda_device,
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/models/model.py", line 406, in load
return model_class._load(config, serialization_dir, weights_file, cuda_device)
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/models/model.py", line 305, in _load
vocab=vocab, params=model_params, serialization_dir=serialization_dir
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/common/from_params.py", line 604, in from_params
**extras,
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/common/from_params.py", line 634, in from_params
return constructor_to_call(**kwargs) # type: ignore
File "/home/yelman/Desktop/dygiepp-master/dygie/models/dygie.py", line 111, in init
params=modules.pop("ner"))
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/common/from_params.py", line 634, in from_params
return constructor_to_call(**kwargs) # type: ignore
File "/home/yelman/Desktop/dygiepp-master/dygie/models/ner.py", line 50, in init
null_label = vocab.get_token_index("", namespace)
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/data/vocabulary.py", line 732, in get_token_index
f"'{token}' not found in vocab namespace '{namespace}', and namespace "
KeyError: "'' not found in vocab namespace 'scierc__ner_labels', and namespace does not contain the default OOV token ('@@unknown@@')"
Hi there! We're trying use the pre-trained ACE05-Event model for a project on new data. However, we've been struggling to get it to run. In order to test it out, we tried the following on a line of data from the SciERC dataset and got the error below.
!echo '{"events": [[], [], [], [], []], "clusters": [[[6, 17], [32, 32]], [[4, 4], [55, 55], [91, 91]], [[58, 62], [64, 64], [79, 79]]], "sentences": [["This", "paper", "presents", "an", "algorithm", "for", "computing", "optical", "flow", ",", "shape", ",", "motion", ",", "lighting", ",", "and", "albedo", "from", "an", "image", "sequence", "of", "a", "rigidly-moving", "Lambertian", "object", "under", "distant", "illumination", "."], ["The", "problem", "is", "formulated", "in", "a", "manner", "that", "subsumes", "structure", "from", "motion", ",", "multi-view", "stereo", ",", "and", "photo-metric", "stereo", "as", "special", "cases", "."], ["The", "algorithm", "utilizes", "both", "spatial", "and", "temporal", "intensity", "variation", "as", "cues", ":", "the", "former", "constrains", "flow", "and", "the", "latter", "constrains", "surface", "orientation", ";", "combining", "both", "cues", "enables", "dense", "reconstruction", "of", "both", "textured", "and", "texture-less", "surfaces", "."], ["The", "algorithm", "works", "by", "iteratively", "estimating", "affine", "camera", "parameters", ",", "illumination", ",", "shape", ",", "and", "albedo", "in", "an", "alternating", "fashion", "."], ["Results", "are", "demonstrated", "on", "videos", "of", "hand-held", "objects", "moving", "in", "front", "of", "a", "fixed", "light", "and", "camera", "."]], "ner": [[[4, 4, "Generic"], [6, 17, "Task"], [20, 21, "Material"], [24, 26, "Material"], [28, 29, "OtherScientificTerm"]], [[32, 32, "Generic"], [42, 42, "Material"], [44, 45, "Material"], [48, 49, "Material"]], [[55, 55, "Generic"], [58, 62, "OtherScientificTerm"], [64, 64, "Generic"], [67, 67, "Generic"], [69, 69, "OtherScientificTerm"], [72, 72, "Generic"], [74, 75, "OtherScientificTerm"], [79, 79, "Generic"], [81, 88, "Task"]], [[91, 91, "Generic"], [95, 105, "Method"]], [[115, 118, "Material"]]], "relations": [[[4, 4, 6, 17, "USED-FOR"], [20, 21, 4, 4, "USED-FOR"], [24, 26, 20, 21, "FEATURE-OF"], [28, 29, 24, 26, "FEATURE-OF"]], [[42, 42, 44, 45, "CONJUNCTION"], [44, 45, 48, 49, "CONJUNCTION"]], [[58, 62, 55, 55, "USED-FOR"], [67, 67, 64, 64, "HYPONYM-OF"], [67, 67, 69, 69, "USED-FOR"], [67, 67, 72, 72, "CONJUNCTION"], [72, 72, 64, 64, "HYPONYM-OF"], [72, 72, 74, 75, "USED-FOR"], [79, 79, 81, 88, "USED-FOR"]], [[95, 105, 91, 91, "USED-FOR"]], []], "doc_key": "ICCV_2003_158_abs"}' > scierc-test.jsonl
!allennlp predict pretrained/ace05-event.tar.gz \
scierc-test.jsonl \
--predictor dygie \
--include-package dygie \
--use-dataset-reader \
--output-file predictions/scierc-test.jsonl
This is the error:
2020-06-09 22:19:37,555 - ERROR - allennlp.data.vocabulary - Namespace: ner_labels
2020-06-09 22:19:37,555 - ERROR - allennlp.data.vocabulary - Token: Generic
Traceback (most recent call last):
File "/usr/local/bin/allennlp", line 8, in <module>
sys.exit(run())
File "/usr/local/lib/python3.6/dist-packages/allennlp/run.py", line 18, in run
main(prog="allennlp")
File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/__init__.py", line 102, in main
args.func(args)
File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/predict.py", line 227, in _predict
manager.run()
File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/predict.py", line 201, in run
for model_input_instance, result in zip(batch, self._predict_instances(batch)):
File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/predict.py", line 159, in _predict_instances
results = [self._predictor.predict_instance(batch_data[0])]
File "./dygie/predictors/dygie.py", line 81, in predict_instance
dataset.index_instances(model.vocab)
File "/usr/local/lib/python3.6/dist-packages/allennlp/data/dataset.py", line 155, in index_instances
instance.index_fields(vocab)
File "/usr/local/lib/python3.6/dist-packages/allennlp/data/instance.py", line 72, in index_fields
field.index(vocab)
File "/usr/local/lib/python3.6/dist-packages/allennlp/data/fields/sequence_label_field.py", line 100, in index
for label in self.labels]
File "/usr/local/lib/python3.6/dist-packages/allennlp/data/fields/sequence_label_field.py", line 100, in <listcomp>
for label in self.labels]
File "/usr/local/lib/python3.6/dist-packages/allennlp/data/vocabulary.py", line 637, in get_token_index
return self._token_to_index[namespace][self._oov_token]
KeyError: '@@UNKNOWN@@'
We aren't super familiar with allennlp, so we aren't sure how to diagnose this issue. We're not sure why there is an issue with the NER_labels as we're hoping to just predict those, rather than train the model on them. Is there a way to override this?
Thanks a lot!
When I run this command bash ./scripts/train/train_genia.sh 0. I got this error.Where to get model.tar.gz
Hi ,
I thinks span representation is a great idea. Do you think the span representation is suitable for keywords extraction and chinese word sementation?
Thanks!
I want to run the event extraction pipeline on my own dataset of ACE/ERE-like events.
The events have multi-token and discontinuous trigger spans.
I have not yet dived deep into the model code, but was hoping for some ideas how to adapt it to multi-token spans.
For now, I will parse and use the head token of the trigger annotations, but due to the discriminative and content-rich nature of my multi-token triggers I expect it to hurt trigger classification.
Thank you for making this code available and supporting it. Having a SotA event extraction system available really helps my research.
It looks like scripts/pretrained/get_scibert.py
was removed in this commit, but the Dockerfile
still expects it
$ docker build .
420100K .......... .......... .......... .......... .......... 99% 2.89M 0s
420150K .......... .......... .......... .......... .......... 99% 4.39M 0s
420200K .......... ... 100% 108M=4m20s
2021-05-13 18:44:49 (1.58 MB/s) - ‘./pretrained/mechanic-granular.tar.gz’ saved [430298612/430298612]
Removing intermediate container 2024d2013fad
---> b44f296f2ce3
Step 19/27 : COPY scripts/pretrained/get_scibert.py /tmp/get_scibert.py
COPY failed: stat /var/lib/docker/tmp/docker-builder787936846/scripts/pretrained/get_scibert.py: no such file or directory
I understand that the Dockerfile isn't officially supported; just wanna log this here in case someone encounters the same issue
I tried setting the loss_weights: ner:
to 0, but training errors with
2020-08-14 08:45:43,915 - INFO - allennlp.training.trainer - Training
0%| | 0/183 [00:04<?, ?it/s]
Traceback (most recent call last):
File "/opt/conda/envs/dygiepp/bin/allennlp", line 8, in <module>
sys.exit(run())
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
main(prog="allennlp")
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 102, in main
args.func(args)
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 124, in train_model_from_args
args.cache_prefix)
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 168, in train_model_from_file
cache_directory, cache_prefix)
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 252, in train_model
metrics = trainer.train()
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 478, in train
train_metrics = self._train_epoch(epoch)
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 320, in _train_epoch
loss = self.batch_loss(batch_group, for_training=True)
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 256, in batch_loss
output_dict = training_util.data_parallel(batch_group, self.model, self._cuda_devices)
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/util.py", line 331, in data_parallel
outputs = parallel_apply(replicas, inputs, moved, used_device_ids)
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/torch/_utils.py", line 369, in reraise
raise self.exc_type(msg)
KeyError: Caught KeyError in replica 0 on device 2.
Original Traceback (most recent call last):
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "./dygie/models/dygie.py", line 298, in forward
ner_labels, metadata)
File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "./dygie/models/events.py", line 183, in forward
ner_scores = output_ner["ner_scores"]
KeyError: 'ner_scores'
Seems to that in events.py
the ner_scores are not optional right now as they are passed as a req. arg. to the _mention_pruner
func.
For now I am testing the pipeline by producing NER predictions from the pre-trained ACE05-event model on my custom data and feeding that into the model for silver-standard reference NER labels.
I set event_args_use_ner_labels
to false
too. Is this required or are the NER labels referred to here model predictions not based off gold-standard labels?
I get higher F1 for TrigC and ArgC when setting this to false when training with NER labels produced by prediction from the pre-trained ACE05 model.)
Hi,
this is the first time I use dygiepp, so I am quite unexperienced.
I used the processed ace-event dataset
when starting the training I get error
ValueError: The following unexpected fields should be prefixed with an underscore: sentence_start.
one of the fields in the used .json files is
"sentence_start"..."
In document.py I find
"Make sure we only have allowed fields."
allowed_field_regex = ("doc_key|dataset|sentences|weight|.*ner$|"
".*relations$|.*clusters$|.*events$|^_.*")
Prefixing sentence_start with underscore yields another error message
Thanks for helping me out
I have used the code to preprocess the ACE05 dataset. I am wondering if you could explain the output format.
I got 351/80/80 lines in train/dev/test json files. One line from the train.json file looks like below.
{"sentences": [["CNN_ENG_20030515_073019", ".7"], ["NEWS", "STORY"], ["2003-05-15", "09:52:27"], ["earlier", "we", "talk", "about", "a", "new", "book", "claiming", "that", "president", "john", "kennedy", "had", "an", "affair", "with", "a", "white", "house", "intern", "early", "1960s", "."], ["a", "kennedy", "biographer", "robert", "dallek", "came", "across", "the", "story", "while", "doing", "research", "."], ["the", "woman", "'s", "name", "has", "remain", "a", "mystery", "."], ["the", "60-year-old", "tells", "the", "new", "york", "daily", "news", "and", "others", "dpa", "she", "is", "glad", "to", "have", "the", "weight", "she", "'s", "been", "carrying", "for", "41", "years", "now", "off", "her", "shoulders", "."], ["she", "says", "she", "was", "19", "at", "the", "time", ",", "working", "in", "d.c", ".", "at", "the", "white", "house", ",", "1962", ",", "1963", ",", "she", "says", "today", "the", "allegations", "about", "her", "affair", "are", "the", "truth", "."], ["right", "now", "she", "lives", "on", "the", "upper", "east", "side", ",", "works", "at", "a", "presbyterian", "church", ",", "has", "two", "married", "daughters", "and", "after", "the", "news", "is", "out", ",", "after", "carrying", "it", "for", "41", "years", ",", "she", "feels", "better", "about", "it", "."], ["this", "news", "breaking", "just", "today", "."], ["robert", "dallek", "tried", "to", "do", "the", "research", "to", "track", "this", "woman", "down", "in", "the", "book", ",", "it", "did", "not", "happen", "but", "now", "she", "has", "indeed", "come", "forward", "."], ["2003-05-15", "09:53:14"]], "ner": [[], [], [], [[7, 7, "PER"], [16, 17, "PER"], [15, 15, "PER"], [25, 25, "PER"], [23, 24, "ORG"]], [[30, 30, "PER"], [31, 31, "PER"], [32, 33, "PER"]], [[43, 43, "PER"]], [[52, 52, "PER"], [62, 62, "PER"], [69, 69, "PER"], [78, 78, "PER"], [55, 58, "ORG"], [60, 60, "ORG"]], [[92, 92, "GPE"], [83, 83, "PER"], [103, 103, "PER"], [109, 109, "PER"], [81, 81, "PER"], [96, 97, "ORG"]], [[121, 123, "LOC"], [134, 134, "PER"], [129, 129, "ORG"], [117, 117, "PER"], [149, 149, "PER"]], [], [[171, 171, "PER"], [183, 183, "PER"], [161, 162, "PER"]], []], "relations": [[], [], [], [[16, 17, 25, 25, "PER-SOC"], [25, 25, 23, 24, "ORG-AFF"]], [], [], [], [[83, 83, 96, 97, "ORG-AFF"], [81, 81, 92, 92, "PHYS"]], [[117, 117, 121, 123, "GEN-AFF"], [117, 117, 129, 129, "ORG-AFF"], [117, 117, 134, 134, "PER-SOC"]], [], [], []], "clusters": [], "doc_key": "CNN_ENG_20030515_073019.7"}
Hi,
Thank you for the excellent work. I follow the steps by setting up the environment and install all requirements exactly. I want to download the scierc dataset but have the following problem.
I ran these command exactly:
conda create --name dygiepp python=3.7
pip install -r requirements.txt
conda develop .
bash ./scripts/data/get_scierc.sh
This is the problem I have:
Traceback (most recent call last):
File "scripts/data/shared/normalize.py", line 5, in <module>
from dygie.data.dataset_readers.document import Document, Dataset
File "/ldap_shared/home/v_yuchen_zeng/696ds/dygiepp-master/dygie/data/__init__.py", line 1, in <module>
from dygie.data.dataset_readers.dygie import DyGIEReader
File "/ldap_shared/home/v_yuchen_zeng/696ds/dygiepp-master/dygie/data/dataset_readers/dygie.py", line 29, in <module>
class DyGIEReader(DatasetReader):
File "/ldap_shared/home/v_yuchen_zeng/696ds/dygiepp-master/dygie/data/dataset_readers/dygie.py", line 202, in DyGIEReader
@overrides
File "/ldap_shared/home/v_yuchen_zeng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/overrides/overrides.py", line 67, in overrides
raise AssertionError('No super class method found for "%s"' % method.__name__)
AssertionError: No super class method found for "_instances_from_cache_file"
Traceback (most recent call last):
File "scripts/data/shared/collate.py", line 4, in <module>
from dygie.data.dataset_readers import document
File "/ldap_shared/home/v_yuchen_zeng/696ds/dygiepp-master/dygie/data/__init__.py", line 1, in <module>
from dygie.data.dataset_readers.dygie import DyGIEReader
File "/ldap_shared/home/v_yuchen_zeng/696ds/dygiepp-master/dygie/data/dataset_readers/dygie.py", line 29, in <module>
class DyGIEReader(DatasetReader):
File "/ldap_shared/home/v_yuchen_zeng/696ds/dygiepp-master/dygie/data/dataset_readers/dygie.py", line 202, in DyGIEReader
@overrides
File "/ldap_shared/home/v_yuchen_zeng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/overrides/overrides.py", line 67, in overrides
raise AssertionError('No super class method found for "%s"' % method.__name__)
AssertionError: No super class method found for "_instances_from_cache_file"
I tried emnlp-2019 branch but I still got the same problem. Have you ever had to this problem before? Thanks!
I'm going to close this for lack of activity, feel free to reopen if not resolved.
Originally posted by @dwadden in #59 (comment)
Hi, thanks for your response. I was able to bypass the error last time and was able to extract entities on my custom dataset (wasn't able to run it on the pre-existing test.json). I am trying to run it again on a different dataset, but the error persists. I created a new environment and installed dependencies using the updated requirements.txt scratch. The stack trace is the same as above.
Thank you for making the code available! It would be very useful to us, but could you maybe add a license?
Hello,
Thanks for the great work! I really like it and appreciate posting your work here.
so my end goal is to predict the relationship of the dose group (drug) and adverse events using your model. eg. the text will contain the following sentences: group 3 showed decrease in food consumption. group 2 had inflammation and increase in alopecia. Group 4 exhibited sudden heart attack.
Here, the entities will be the dose group, adverse events (bolded) and the relations would be either increase, decrease, present (adverse event present)
The raw data that I have is semi-unlabeled where I will have a sentence, label entities and relations but don't have indices of where these entities & relations start and ends within a sentence. (like you mention here)
Can you please provide help in how to pre-process my raw data so that it goes into your model as an input?
Also with preprocessed data, do you think it'll work when I train Chemprot model with my new training dataset? I found that chemprot is the closest domain/model to my dataset.
Thank you so much in advance and I hope you reply back :)
Hi! I'm interested in using dygiepp
for automatically generating keywords from scientific abstracts, and testing how it performs compared to Textacy algorithms for keyword extraction (for context, see Mini-Conf/Mini-Conf#34).
I'm looking to preprocess sample abstracts in https://github.com/anaerobeth/dygiepp/tree/keyword-extraction/data/miniconf/raw_data to generate suitable .txt and .ann files for use in making predictions using the pretrained scierc lightweight model. Can you provide some guidance on how to get started with this task? Thanks!
Hi,
Where do you get the "English" file or directory for ace05.
After running get_corenlp.sh and then subsequently running get_ace05.sh I get an error:
cp: cannot stat ‘./scripts/data/ace05/common//English’: No such file or directory
run.zsh:4: no matches found: English//timex2norm/*.sgm
etc
Thanks for any help.
Hi,
In the SciERC dataset (scierc.raw.tar.gz), for each abstract doc, there are three files - .txt (which is the raw text of the abstract), .ann (which captures the annotations for entities, relations, corefs). There is also a third file .xml.txt associated with each doc abstract - this seems to contain some details including the pos, parse tree/dependency information. What is this xml file and how is this to be generated? What is the process to create this if we want to bring in our own domain-specific documents to be trained on dygiee++?
Would appreciate a quick response, thanks,
Sundar
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.