cgraywang / deepex Goto Github PK
View Code? Open in Web Editor NEWCode repo for EMNLP21 paper "Zero-Shot Information Extraction as a Unified Text-to-Triple Translation"
Home Page: https://arxiv.org/pdf/2109.11171.pdf
License: Apache License 2.0
Code repo for EMNLP21 paper "Zero-Shot Information Extraction as a Unified Text-to-Triple Translation"
Home Page: https://arxiv.org/pdf/2109.11171.pdf
License: Apache License 2.0
Hi,
did you relese a skript about "Magolor/deepex-ranking-model"?
I dont know how to use it.
Hi,
I am unable to figure out how to run OIE inference using the deepex models. Can you please help answering:
Hi,
I am unable to figure out how to use the deepex model. I want to input a sentence and get the triplets generated from it. Is there any code snippet/script in the repo that can help me do that?
Thanks
使用 ‘git clone --recursive [email protected]:cgraywang/deepex.git’ 进行下载时出现:
Cloning into 'deepex'...
Permission denied (publickey).
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
I followed the README and successfully run the OpenIE16 benchmark, then I modified OIE_2016.json
file to point to my directory with test.txt
file containing just one line Julia owns two cats and one dog.
The output is however really poor, the expected triples [Julia, owns, two cats]
and [Julia, owns, one dog]
have low scores and there are many other (ill-created) triples, sometimes with even higher score values (complete output attached below).
Is this the normal behavior of the model? Why is the performance so poor? Is there a systematic issue with how I am doing this?
['$input_txt:$ Julia owns two cats and one dog',
{'deduplicated:': {'Julia [SEP] owns two [SEP] One Dog': [2,
0.15572701767086983,
[[0, 5], [24, 31]],
8,
0],
'Julia [SEP] owns two cats [SEP] One Dog': [5,
0.38142501655966043,
[[0, 5], [24, 31]],
21,
0],
'Julia [SEP] cats [SEP] One Dog': [2,
0.09584893216378987,
[[0, 5], [24, 31]],
6,
0],
'Two Cats [SEP] cats [SEP] One Dog': [2,
0.09491016250103712,
[[11, 19], [24, 31]],
6,
0],
'Julia [SEP] owns two cats and [SEP] One Dog': [6,
0.4070159122347832,
[[0, 5], [24, 31]],
26,
0],
'Julia [SEP] cats and [SEP] Two Cats': [2,
0.14055378548800945,
[[0, 5], [11, 19]],
9,
0],
'Julia [SEP] two cats [SEP] One Dog': [1,
0.06196947582066059,
[[0, 5], [24, 31]],
4,
0],
'Julia [SEP] one [SEP] Two Cats': [1,
0.06188515014946461,
[[0, 5], [11, 19]],
4,
0],
'Julia [SEP] owns [SEP] Two Cats': [6,
0.2982198027893901,
[[0, 5], [11, 19]],
20,
0],
'Julia [SEP] owns [SEP] One Dog': [4,
0.17877793312072754,
[[0, 5], [24, 31]],
12,
0],
'Julia [SEP] two [SEP] One Dog': [3,
0.1447404371574521,
[[0, 5], [24, 31]],
10,
0],
'Julia [SEP] and one [SEP] Two Cats': [5,
0.32297115167602897,
[[0, 5], [11, 19]],
23,
0],
'Two Cats [SEP] owns [SEP] One Dog': [8,
0.44091942673549056,
[[11, 19], [24, 31]],
32,
0],
'Julia [SEP] and [SEP] One Dog': [1,
0.04122000187635422,
[[0, 5], [24, 31]],
3,
0],
'Julia [SEP] cats and one [SEP] Two Cats': [2,
0.10967723815701902,
[[0, 5], [11, 19]],
8,
0],
'Two Cats [SEP] one [SEP] One Dog': [2,
0.08130411058664322,
[[11, 19], [24, 31]],
6,
0],
'Two Cats [SEP] owns two [SEP] One Dog': [8,
0.4609282175078988,
[[11, 19], [24, 31]],
36,
0],
'Julia [SEP] and [SEP] Two Cats': [3,
0.14550211280584335,
[[0, 5], [11, 19]],
12,
0],
'Two Cats [SEP] two [SEP] One Dog': [4,
0.19086267473176122,
[[11, 19], [24, 31]],
16,
0],
'Two Cats [SEP] and [SEP] One Dog': [6,
0.1381131475791335,
[[11, 19], [24, 31]],
21,
0]}}]
The file of P0_result.json is empty.
An Evaluation on BenchIE would be great. Since it is probably the best available dataset.
Link to Dataset: https://paperswithcode.com/dataset/benchie
Hi
Could you described the steps to reproduce the results in your paper for OIE 2016 and the other systems. You used the supervised-oie repo and the benchmark included in there. There is also a related repo from the same author oie-benchmark with the same evaluation. However they are slightly different and have had updates and neither seems to work out of the box when running the eval.sh script which should produce the PR Curve plots (part of what i was interested in comparing).
Looking at them its unclear which version of the benchmark corpus is being used and whether you picked test (or dev) split or used the whole dataset. The default script seems to use all the data but when going through the tasks you provide it seems to use the test split.
Thanks
Tony
Hi, thank you for your good work in OIE2016, but I use your default parameters with 1 V100 GPU, but I get the following results:
Did I have some mistakes that I cannot get the 0.72 F1?
Also, I find that the constractive pretrain deep-ranking-model has not been released and the code is inference code now. It needs about 3 hours to test. Am I right?
Hope for your reply!
Thank you for your great work on zero-shot IE. I am very impressed with the results while I am very curious on what is the 'task agnostic corpus', is it an open dataset?
PyTorch 1.7.1 with CUDA 10.1, single GPU.
After running
git clone --recursive [email protected]:cgraywang/deepex.git
cd ./deepex
conda create --name deepex python=3.7 -y
conda activate deepex
pip install -r requirements.txt
pip install -e .
bash tasks/OIE_2016.sh
I am getting
FileNotFoundError: [Errno 2] No such file or directory: 'result/OIE_2016.bert-large-cased.np.d2048.b6.sorted/P0_result.json'
I suspect that because the output file was not generated, so digging in the trace log I found:
RuntimeError
: RuntimeErrorRuntimeError
NCCL error in: /opt/conda/conda-bld/pytorch_1607370141920/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8: RuntimeError:
NCCL error in: /opt/conda/conda-bld/pytorch_1607370141920/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8: NCCL error in: /opt/conda/conda-bld/pytorch_1607370141920/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8RuntimeError
And some other NCCL errors.
I am not sure where to start debugging.
Thank you in advance.
I am trying to do a simple text-to-triple inference task (openIE) and I thought that this would be the way to start.
Hello, thanks for your contribution to Open Information Extraction research.
I'm currently working on using your repo to create triples from the raw texts in Wikipedia by using your code.
bash tasks/OIE_2016.sh
I changed the data_dir to construct triples then executed the above code.
Code works fine, and it gave me the output called search_res.json.
The problem is, I can't find the description for the format of this output.
I can't find the keys for each value of this output.
{"deduplicated:": {"Another Temporary Gallery [SEP] gallery [SEP] The Museum": [8, 0.7081716619431973, [[0, 25], [65, 75]], 27, 0] ...
For example, the thing I want to know is this list.
[8, 0.7081716619431973, [[0, 25], [65, 75]], 27, 0]
Coudl you briefly explain what each of this value mean?
使用论文中举的例子 ,”Born in Glasgow, Fisher is a graduate of the London Opera Centre. “ ,在第一步generator中抽取的三元组结果比较混乱,不知是否正常(OIE2016结果复现没有问题,top3 f1=72.6)。模型为默认的bert-large-cased。
search_res.json.txt
Thank you for great work in zero-shot IE. Where can I find the related script about the pre-training about Magolor/deepex-ranking-model
When I tried to run OIE on a new sentence, I got ValueError: y_true takes value in {} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
The only modification was changing the content of supervised-oie/supervised-oie-benchmark/raw_sentences/test.txt
to just
Julia owns two cats and one dog.
As in #17
Please help @cgraywang @jesseLiu2000 @filip-cermak
(deepex_new) root@c605ffdb427a:/home/zhanwen/deepex_new# bash tasks/OIE_2016.sh
Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, 16-bits training: False
Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: False
08/14/2023 22:24:07 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, 16-bits training: False
08/14/2023 22:24:07 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: False
Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: False
08/14/2023 22:24:07 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: False
Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, 16-bits training: False
08/14/2023 22:24:07 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, 16-bits training: False
Some weights of the model checkpoint at bert-large-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-large-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-large-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-large-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
time spent on loading data augmentation: 0.8261523246765137s
08/14/2023 22:24:13 - INFO - __main__ - time spent on loading data augmentation: 0.8261523246765137s
create batch examples...: 1it [00:00, 2337.96it/s] | 0/1 [00:00<?, ?it/s]
time spent on loading data augmentation: 0.8587894439697266s
08/14/2023 22:24:14 - INFO - __main__ - time spent on loading data augmentation: 0.8587894439697266s
Generate dataset and results: 0%| | 0/4 [00:00<?, ?it/s]08/14/2023 22:24:14 - INFO - deepex.model.kgm - ***** Running Generate_triplets *****
08/14/2023 22:24:14 - INFO - deepex.model.kgm - Num examples = 1
08/14/2023 22:24:14 - INFO - deepex.model.kgm - Batch size = 16
time spent on loading data augmentation: 0.851825475692749s
08/14/2023 22:24:14 - INFO - __main__ - time spent on loading data augmentation: 0.851825475692749s
create batch examples...: 1it [00:00, 1945.41it/s] | 0/4 [00:00<?, ?it/s]
time spent on loading data augmentation: 0.84853196144104s | 0/1 [00:00<?, ?it/s]
08/14/2023 22:24:14 - INFO - __main__ - time spent on loading data augmentation: 0.84853196144104s
create batch examples...: 1it [00:00, 2023.30it/s] | 0/4 [00:00<?, ?it/s]
create batch examples...: 1it [00:00, 2041.02it/s]t/s]
Generate batch dataset and results: 0it [00:00, ?it/s] 08/14/2023 22:24:14 - INFO - deepex.model.kgm - forward time cost 0.2758486270904541s
08/14/2023 22:24:15 - INFO - deepex.model.kgm - search time cost 0.8128805160522461s
Generate_triplets: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.31s/it]
total producing triplets time: 1.6015987396240234s | 0/1 [00:00<?, ?it/s]
08/14/2023 22:24:15 - INFO - __main__ - total producing triplets time: 1.6015987396240234s
total dump triplets time: 0.0033626556396484375s
08/14/2023 22:24:15 - INFO - __main__ - total dump triplets time: 0.0033626556396484375s███████████████| 1/1 [00:01<00:00, 1.31s/it]
convert batch examples to features...: 1it [00:01, 1.62s/it]
process feature files...: 1it [00:01, 1.62s/it] 1.62s/it]
Generate batch dataset and results: 1it [00:01, 1.62s/it]
Generate dataset and results: 100%|██████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.70s/it]
total time: 2.525900363922119s
08/14/2023 22:24:15 - INFO - __main__ - total time: 2.525900363922119s
Generate_triplets: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.34s/it]
total producing triplets time: 1.468618631362915s
08/14/2023 22:24:15 - INFO - __main__ - total producing triplets time: 1.468618631362915s
total dump triplets time: 0.0034074783325195312s
08/14/2023 22:24:15 - INFO - __main__ - total dump triplets time: 0.0034074783325195312s███████████████| 1/1 [00:01<00:00, 1.34s/it]
convert batch examples to features...: 1it [00:01, 1.48s/it]
process feature files...: 1it [00:01, 1.48s/it] 1.48s/it]
Generate batch dataset and results: 1it [00:01, 1.48s/it]
Generate dataset and results: 100%|██████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00, 2.57it/s]
total time: 2.4193410873413086s
08/14/2023 22:24:15 - INFO - __main__ - total time: 2.4193410873413086s
Generate_triplets: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.34s/it]
total producing triplets time: 1.5090875625610352s
08/14/2023 22:24:15 - INFO - __main__ - total producing triplets time: 1.5090875625610352s
total dump triplets time: 0.004492044448852539s
08/14/2023 22:24:15 - INFO - __main__ - total dump triplets time: 0.004492044448852539s████████████████| 1/1 [00:01<00:00, 1.34s/it]
convert batch examples to features...: 1it [00:01, 1.52s/it]
process feature files...: 1it [00:01, 1.52s/it] 1.52s/it]
Generate batch dataset and results: 1it [00:01, 1.52s/it]
Generate dataset and results: 100%|██████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00, 2.50it/s]
total time: 2.4521644115448s
08/14/2023 22:24:15 - INFO - __main__ - total time: 2.4521644115448s
Generate_triplets: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.48s/it]
total producing triplets time: 1.6742017269134521s
08/14/2023 22:24:16 - INFO - __main__ - total producing triplets time: 1.6742017269134521s
total dump triplets time: 0.0035593509674072266s
08/14/2023 22:24:16 - INFO - __main__ - total dump triplets time: 0.0035593509674072266s███████████████| 1/1 [00:01<00:00, 1.47s/it]
convert batch examples to features...: 1it [00:01, 1.68s/it]
process feature files...: 1it [00:01, 1.68s/it] 1.68s/it]
Generate batch dataset and results: 1it [00:01, 1.68s/it]
Generate dataset and results: 100%|██████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00, 2.27it/s]
total time: 2.614199638366699s
08/14/2023 22:24:16 - INFO - __main__ - total time: 2.614199638366699s
deduplicating batch: 0%| | 0/4 [00:00<?, ?it/s]output/classified/OIE_2016/P0/0_BertTokenizerFast_NPMentionGenerator_256_0_1/search_res.json
deduplicating doc: 100%|████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 201.03it/s]
output/classified/OIE_2016/P0/0_BertTokenizerFast_NPMentionGenerator_256_0_2/search_res.json | 0/1 [00:00<?, ?it/s]
deduplicating doc: 100%|████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 207.29it/s]
output/classified/OIE_2016/P0/0_BertTokenizerFast_NPMentionGenerator_256_0_3/search_res.json | 0/1 [00:00<?, ?it/s]
deduplicating doc: 100%|████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 207.98it/s]
output/classified/OIE_2016/P0/0_BertTokenizerFast_NPMentionGenerator_256_0_0/search_res.json | 0/1 [00:00<?, ?it/s]
deduplicating doc: 100%|████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 206.81it/s]
deduplicating batch: 100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 177.91it/s]
sorting: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4064.25it/s]
merging doc: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 9822.73it/s]
total triplets: 2560
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.81s/it]
/root/miniconda3/envs/deepex_new/lib/python3.7/site-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.preprocessing.data module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.preprocessing. Anything that cannot be imported from sklearn.preprocessing is now part of the private API.
warnings.warn(message, FutureWarning)
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
INFO:root:Writing PR curve of DeepEx to eval_data/OIE_2016/deepex.oie_2016.3.dat
Traceback (most recent call last):
File "benchmark.py", line 231, in <module>
error_file = args["--error-file"])
File "benchmark.py", line 101, in compare
recallMultiplier = ((correctTotal - unmatchedCount)/float(correctTotal)))
File "benchmark.py", line 125, in prCurve
precision_ls, recall_ls, thresholds = precision_recall_curve(y_true, y_scores)
File "/root/miniconda3/envs/deepex_new/lib/python3.7/site-packages/sklearn/metrics/_ranking.py", line 653, in precision_recall_curve
sample_weight=sample_weight)
File "/root/miniconda3/envs/deepex_new/lib/python3.7/site-packages/sklearn/metrics/_ranking.py", line 544, in _binary_clf_curve
classes_repr=classes_repr))
ValueError: y_true takes value in {} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
OIE_2016 (top 3)
In ./scripts/deepex/model/kgm.py
, line444, you wrote inputs['head_entity_ids'][b][i].name=="$NIL"
, however, there was no context. Other places were $NIL$
, was this a mistake?
请问论文中公式(1)中loss的实现具体在哪个文件中?仅在Reranking函数中计算了sentence embedding和triple embedding的一阶范数。
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.