fenchri / edge-oriented-graph Goto Github PK

Source code for the EMNLP 2019 paper: "Connecting the Dots: Document-level Relation Extraction with Edge-oriented Graphs"

License: Other

Python 93.82% Shell 4.61% C 1.57%

edge-oriented-graph's People

Contributors

Stargazers

Watchers

Forkers

bowdbeg databill86 ammieqi dingxg olvbm slzbywdf nguyenvanhoang7398 greitzmann kchennen seanswyi csitfun sunshone mylv1222 awyys xu-ys xzxinlin aqhali

edge-oriented-graph's Issues

need your help!

Hello, I'm sorry to disturb you. Recently, I was studying your algorithm, but there was an error in the data processing code, which could not be solved. Could you please give me a copy of your running process_ Cdr.sh Data set after sh?

training error

hello,dear fenchri.when i run this comman:
python3 eog.py --config ../configs/parameters_cdr.yaml --train --gpu -1 --early_stop

something wrong as follows:

Thank you so so so much!

question about datasets statistics

In your paper, the number of negative pairs of train, dev and test are 4202, 4075 and 4138, however, the statistics.py gives me 1572(intra)+2822(inter)=4394, 1709+2540=4249, 1656+2683=4339, could you please tell me where may the difference come from?

Is there a processed CDR and GDA dataset

something wrong with me.

when I run : cd genia-tagger-py && make
the error :

thanks!

Is CDR dev dataset not be used?

It seems that actual test dataset is not used in the experiment.

In configs/parameters_cdr.yaml, dev_filter.data is named as test file and dev_filter.data is used both in training and testing procedure. In other words, you test the model in dev dataset.

Most of all, the performence is not good enough as you report in paper.
In dev dataset:
Loading model & parameters ... TEST | LOSS = 0.46038, ACC = 0.8370 , MICRO P/R/F1 = 0.5810 0.6482 0.6128 | TP/ACTUAL/PRED = 656 /1012 /1129 , TOTAL 5087 | 0h 00m 11s

And in actual test dateset:
TEST | LOSS = 0.47821, ACC = 0.8263 , MICRO P/R/F1 = 0.5732 0.5947 0.5838 | TP/ACTUAL/PRED = 634 /1066 /1106 , TOTAL 5204 | 0h 00m 11s

reported in your paper:

Is there anthing wrong?

confused about the graph structure?

@fenchri
hi fenchri:
i am curious about the graph in this paper,is it an ordinary graph just as a data structure.and diffirent from gnn/gcn?because the words embedding is learning from glove or pretrained but not node2vec?
hope you can see what i say.
thanks!

FileNotFoundError: [Errno 2] No such file or directory: 'path_to_predictions_file'

Hello, fenchri.when Irun this comman:
python3 evaluate.py --pred path_to_predictions_file --gold ../data/CDR/processed/test.gold --label 1:CDR:2
I found that I couldn't find the file "path_to_predictions_file" ,but when I created the discovery in the ** evaluation file, the results were all 0. I hope you can help me

something wrong with "genia-tagger-py"

hello,fenia when i run "cd genia-tagger-py && make",something wrong as follows:

can you help me?thank you sooooo much!

how to understand the processed data

I am a bit confused about the representation of processed_data.

 PID<tab>tokenized_sentences_separated_by_|<tab>relation<tab>direction<tab>cross/non-cross<tab>closest_arg1_start-closest_arg1_end<tab>closest_arg2_start-closest_arg2_end<tab>arg1_KB_id<tab>mentions_for_arg1_separated_by_|<tab>arg1_type<tab>start_token_ids_arg1_separated_by_:<tab>end_token_ids_arg1_separated_by_:<tab>sentence_ids_separated_by_:<tab>arg2_KB_id<tab>mentions_for_arg2_separated_by_|<tab>arg2_type<tab>start_token_ids_arg2_separated_by_:<tab>end_token_ids_arg2_separated_by_:<tab>sentence_ids_separated_by_:

For example, the following text is the third segment of the first line in train_filter.data:

1:CID:2	R2L	NON-CROSS	58-61	51-52	D008750	alpha - methyldopa|alpha - methyldopa	Chemical	58:203	61:206	2:6	D007022	hypotensive	Disease	51	52	2

which i think means there is a CID relation, in the concept level, between the entities D008750 and D007022. D008750 has two mentions at (58, 61) and (203,206), while D007022 has only 1 mention at (51,52). Is this correct?

And what does the non-cross mean here, cause it does have a relation across sentences (e.g. mention 2 of D008750 is in sentence 6 and while D007022 is in sentence 2) ? Is it specifically used for the relation between the closest mention pair?

Error Implementation of Embedding Layer

In modules.py at line 56, you used:

self.embedding.from_pretrained(pret_embeds, freeze=self.freeze)

but the from_pretrained method is a class method, which does not change the self.embedding, so only randomly initialized embedding is used and I think it should be

self.embedding = nn.Embedding.from_pretrained(pret_embeds, freeze=self.freeze)

but when I fixed this bug, the PubMed result (about 0.61) is not good as random (about 0.63, which is your result of PubMed) in CDR dataset.

a question about processed trained.data.

@fenchri
There still has a quesion with me when generate processed trained.data form processed traning.pubtator
Cay you share me an exapmle about what processed trained.data looks like.just show me a pid of trained.data.
thanks !!!

FileNotFoundError: [Errno 2] No such file or directory: 'temp_file.split.txt'

sh process_cdr.sh
11923it [00:00, 189285.60it/s]
Processing Doc_ID 227508: 0%| | 0/500 [00:00<?, ?it/s]Traceback (most recent call last):
File "process.py", line 123, in
main()
File "process.py", line 62, in main
split_sents = sentence_split_genia(orig_sentences)
File "/home/deep/dp/edge-oriented-graph/data_processing/tools.py", line 248, in sentence_split_genia
with open('temp_file.split.txt', 'r') as ifile:
FileNotFoundError: [Errno 2] No such file or directory: 'temp_file.split.txt'

some questions about the processed data

PIDtokenized_sentences_separated_by_|relationdirectioncross/non-crossclosest_arg1_start-closest_arg1_endclosest_arg2_start-closest_arg2_endarg1_KB_idmentions_for_arg1_separated_by_|arg1_typestart_token_ids_arg1_separated_by_:end_token_ids_arg1_separated_by_:sentence_ids_separated_by_:arg2_KB_idmentions_for_arg2_separated_by_|arg2_typestart_token_ids_arg2_separated_by_:end_token_ids_arg2_separated_by_:sentence_ids_separated_by_:

I saw this is the processed data format....
i want to ask what is direction mean??

why

                if p[1] == "L2R":
                    h_id, t_id = p[5], p[11]
                    h_start, t_start = p[8], p[14]
                    h_end, t_end = p[9], p[15]
                else:
                    t_id, h_id = p[5], p[11]
                    t_start, h_start = p[8], p[14]
                    t_end, h_end = p[9], p[15]

what is L2R and R2L why have to change the position of head entity and tail entity??

Some questions about model modification

Thanks for your excellent work！
I want to test the effectiveness of the model when consider all possible relations between entities in one document. So I add all the non-related entity pairs to your model, I modified the following：

reader.py: I add all entities in an article to the OrderedDict "entities" and define a relationship type for “no relation”. Then, I combine all entities in pairs and add them to the OrderedDict "relations".
(if the pair in the annotated relations, use the relation type, otherwise，use the type define for “no relation”)
parameters_cdr.yaml: I add the relation type defined above to the "label2ignore" in the config file.

I would like to ask if there is anything else that needs to be modified?

looking forward to your code

dear fenia, I am looking forward to your code :)

A new task with more entity types and relation types

Thanks for your great work.

I have been trying your model in a new task where there are 9 entity types and 97 relation types. I have done the following modifications:

I change the data_processing module to convert my data into the same format as yours which contains 97 relation types instead.
I also create a new parameters_*.yaml file, and add all possible combinations of the 9 entity types in the "include_pairs" and "classify_pairs" fields.

Is there anything else I need to modify? I am encountering a "nan" issue in the graph layer but not sure whether it is because I forgot something important.

Thanks very much!

What is the latest result in the CDR dataset

I use your new version code, the final result on CDR dataset F1=0.75+. Am I wrong, what is the final result in latest version?

Question about edge representation

Firstly, you can really code, thank you for openning source. The question is that in your implementation, when connecting nodes of different types, the edge's representation depends on the direction, because you just concat first node's representation and second node's representation and ignore their types, for example, x_ms=[n_m;n_s] != x_sm=[n_s;n_m]. It seems to be more like arc, not edge, though I think it's not really matter.

请问有人可以提供出处理好的数据集吗

The test results of CDR reported in your paper use test_filter.data or test.data as evaluation input?

It seems that hypernym filtering has compact to the final results.

Where does 'PubMed-CDR.txt' come from?

Dear Fenia,
I meet an error when run the function 'load_embeds' in 'loader.py'.

FileNotFoundError: [Errno 2] No such file or directory: '../embeds/PubMed-CDR.txt'

I didn't find the missing file in the folder embeds/ and in 'bio_nlp_vec.tar.gz'. Where does 'PubMed-CDR.txt' come from?

Some questions about intra-sentences or inter-sentences

Hello, I used your model to run on the document, but I don't know how to get the results of intra-sentence classification or inter-sentence classification. Could you tell me how to do that? Do you use intra-sentences to train your model and only test on the intra-sentence?

some issues of code for Pre-processing

When i run the script sh process_cdr.sh in Datasets & Pre-processing module, i have failed.
I have found this line 'os.system('./geniass temp_file.txt temp_file.split.txt > /dev/null 2>&1')' report an error. The error screenshot is shown below. Does the code of data processing need ruby?? My linux system don't install ruby, and it is troubled to install it .

geniatagger File not Found Error

This happens when I run process.py to process CDRdata:

Traceback (most recent call last):
  File "process.py", line 17, in <module>
    from tools import sentence_split_genia, tokenize_genia
  File "E:\TOOL\pythonProgram\edge-oriented-graph-master\data_processing\tools.py", line 25, in <module>
    genia_tagger = GENIATagger(os.path.join("./common", "genia-tagger-py", "geniatagger-3.0.2", "geniatagger"))
  File "./common/genia-tagger-py\geniatagger.py", line 34, in __init__
    cwd = directory
  File "E:\TOOL\Annaconda\envs\pycharm\lib\subprocess.py", line 800, in __init__
    restore_signals, start_new_session)
  File "E:\TOOL\Annaconda\envs\pycharm\lib\subprocess.py", line 1207, in _execute_child
    startupinfo)
FileNotFoundError: [WinError 2] 系统找不到指定的文件。

a strange problem

sorry to disturb ! I was stuck at the step
and <process_gda.sh>,
it reminds me that
<FileNotFoundError: [Errno 2] No such file or directory: 'temp_file.split.txt'>
I have no idea what to do with this 'temp_file.split.txt'

Question about processed data

Hi,
Thanks for the great work. I have a couple of questions regarding the output of the BC5CDR processed training data document 6287825 as follows:

The gold standard relations are denoted by 1:CID:2 R2L , right? The only relation that is CID and cross is 1:CID:2 R2L CROSS 129-130 114-118 D007538 Isoniazid Chemical 129 130 7 D010523 Diseases of peripheral nerves|peripheral nerve disease|Sensori - motor neuropathy|motor neuropathy|Peripheral neuropathy|sensori - motor neuropathy Disease 0:17:51:87:90:114 4:20:55:89:92:118 0:1:3:4:5:5 , which means that this relation is inter-sentence relation between Isoniazid in sentence 7 and sensori- motor neuropathy sentence 5 based on the given node's offset. So technically the annotator inferred this based on the context of the 2 sentences, am I right? I believe you name this as positive relation in your code, true?
Regarding the inter-sentence relations that aren't part of the gold standard are denoted by 1:NR:2 L2R CROSS and intra-sentence relations that aren't part of the gold standard are 1:NR:2 R2L NON-CROSS , right? I believe you name this as negative relation in your code, true?
Is there a way to express how you labelled inter-sentence relations? In other words to define the edge construction whether it is mention-mention or mention-sentence or mention-entity or sentence-sentence? For example, I am not sure if 1:NR:2 L2R CROSS 97-98 152-154 D013831 thiamine Chemical 97 98 5 D003389 cranial neuropathy Disease 152 154 8 is sentence-sentence? Can you point me to the part where you implemented the different edge construction approaches?
Have you considered entity-entity relations? To be more precise, if, for example, C1 has a relation with C2 and C2 has a relation with C3, so we can infer that C1 has inter-sentence relation with C3. I understand that in BC5CDR, you focus on chemical-induced-disease relations, however, for example, if I am interested in chemical-chemical relations, would it be possible to add this to the processed data?

Best,
Ghadeer

"NotImplementedError "

when I am trying to run the following commands one by one in colab:

!python3 eog.py --config ../configs/parameters_cdr.yaml --train --gpu -1 --early_stop
!python3 eog.py --config ../configs/parameters_cdr.yaml --train --gpu -1 --nepoch 15 --early_stop

I got the following error:

/content/edge-oriented-graph/src/converter.py:51: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  arrays = numpy.asarray(arrays)
Traceback (most recent call last):
  File "eog.py", line 87, in <module>
    main()
  File "eog.py", line 80, in main
    train(parameters)
  File "eog.py", line 50, in train
    trainer.run()
  File "/content/edge-oriented-graph/src/nnet/trainer.py", line 140, in run
    self.train_epoch(epoch)
  File "/content/edge-oriented-graph/src/nnet/trainer.py", line 178, in train_epoch
    loss, stats, predictions, select = self.model(batch)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/edge-oriented-graph/src/nnet/network.py", line 305, in forward
    batch['section'], batch['distances'])
  File "/content/edge-oriented-graph/src/nnet/network.py", line 90, in graph_layer
    m_cntx = self.attention(mentions, encoded_seq[info[:, 4]], info, word_sec)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 88, in forward
    raise NotImplementedError
NotImplementedError.

I couldn't able to find out what kind of error is this? How do I solve this issue?
Thank you.

filter_hypernyms.py

Hello!
Thanks for your excellent work~
I have a question about filter hyponys.
filter_hypernyms.py
#102 ne[0] = 'not_include' # just don't include the negative pairs, but keep the entities

Why should we do this step~
Looking forward to your reply！

another paper's source code.

hi @fenchri
whethere its convenient for you to share the paper's(Inter-sentence Relation Extraction with Document-level Graph Convolutional Neural Network) source code?
if you can ,this is my mail address:[email protected]
if not,that's ok.
hope you have a nice day!

Question about train error

Hello,dear fenchri.when Irun this comman:
python3 eog.py --config ../configs/parameters_cdr.yaml --train --gpu -0 --early_stop

something wrong as follows:

help

when I run sh process_cdr.sh
The following problems occurred:
czr@DESKTOP-QR7Q65P MINGW64 /d/edge-oriented-graph-reproduceEMNLP/data_processing
$ sh process_cdr.sh
mv: cannot stat '../data/CDR/processed/Training.data': No such file or directory
mv: cannot stat '../data/CDR/processed/Development.data': No such file or directory
mv: cannot stat '../data/CDR/processed/Test.data': No such file or directory
mv: cannot stat '../data/CDR/processed/Training_filter.data': No such file or directory
mv: cannot stat '../data/CDR/processed/Development_filter.data': No such file or directory
mv: cannot stat '../data/CDR/processed/Test_filter.data': No such file or directory
cat: ../data/CDR/processed/train_filter.data: No such file or directory
cat: ../data/CDR/processed/dev_filter.data: No such file or directory

Can the processed versions of these two datasets be used in the same way as the Docred dataset?

Hello, sorry to bother you. I would like to ask if the CDR and GDA datasets processed by your data_processing program can be used on other document-level relation extraction models like the publicly available dataset Docred. After processing, I noticed that the files are in the .data format, while Docred dataset is in the .json format. The formats seem to be different on both sides. Any help would be greatly appreciated.

data process and geniass install problem

I also got an error while use the script process_cdr,sh..... it is

I think it can't read the temp file.... but I never change the code ....

Originally posted by @miaodatiancai in #24 (comment)

Question about result evaluation

why the gold data you generated in CDR dataset contains 1:NR:2? I think it should only consider about 1:CID:2 type.
For example, you generated follow lines as gold data

8701013|D015738|D003693|NON-CROSS|1:CID:2
8701013|D015738|D014456|NON-CROSS|1:NR:2
439781|D007213|D007022|NON-CROSS|1:CID:2
439781|D012964|D007022|NON-CROSS|1:NR:2
439781|D011453|D007022|CROSS|1:NR:2
439781|D000809|D007022|CROSS|1:NR:2

why not

8701013|D015738|D003693|NON-CROSS|1:CID:2
439781|D007213|D007022|NON-CROSS|1:CID:2

error happened when run sh process_cdr.sh,No such file or directory: '../data/CDR/processed/Training.data'

Error in Data Processing

In data processing, after runing sh process_cdr.sh, I got this error: 'FileNotFoundError: [Errno 2] No such file or directory: 'temp_file.split.txt''

The full error is:

11923it [00:00, 45980.01it/s] Processing Doc_ID 227508: 0%| | 0/500 [00:00<?, ?it/s] Traceback (most recent call last): File "process.py", line 123, in <module> main() File "process.py", line 62, in main split_sents = sentence_split_genia(orig_sentences) File "/disk/apouranb/RAMS/RAMS/classification_code/OT/EoG/edge-oriented-graph/data_processing/tools.py", line 248, in sentence_split_genia with open('temp_file.split.txt', 'r') as ifile: FileNotFoundError: [Errno 2] No such file or directory: 'temp_file.split.txt' Loading examples from ../data/CDR/processed/Training.data Mesh entities: 28470 Positive Docs: 0 Negative Docs: 0 Positive Count: 0 Initial Negative Count: 0 Final Negative Count: 0 Hyponyms: 0 0 12103it [00:00, 45581.38it/s] Processing Doc_ID 6794356: 0%| | 0/500 [00:00<?, ?it/s] Traceback (most recent call last): File "process.py", line 123, in <module> main() File "process.py", line 62, in main split_sents = sentence_split_genia(orig_sentences) File "/disk/apouranb/RAMS/RAMS/classification_code/OT/EoG/edge-oriented-graph/data_processing/tools.py", line 248, in sentence_split_genia with open('temp_file.split.txt', 'r') as ifile: FileNotFoundError: [Errno 2] No such file or directory: 'temp_file.split.txt' Loading examples from ../data/CDR/processed/Development.data Mesh entities: 28470 Positive Docs: 0 Negative Docs: 0 Positive Count: 0 Initial Negative Count: 0 Final Negative Count: 0 Hyponyms: 0 0

Server error

When I was compiling genia-tagger-py, I received '503 Service Temporarily Unavailable'. Please tell me how to solve it, thanks.

geniass link returns 404 Not Found

Hello,

Thank you for your valuable work!

I have recently tried to follow the procedure to process both CDR and GDA datasets. However, I found a blocking issue, where the http://www.nactem.ac.uk/y-matsu/geniass/geniass-1.00.tar.gz returns a 404 Not Found. Would it be possible to have access to this package?

Best,
Andrei

Hi, I got a problem about:AttributeError: 'GeniaTagger' object has no attribute 'tag'

I'm curious about the reason for this mistake.