cgraywang / deepex Goto Github PK

View Code? Open in Web Editor NEW

105.0 5.0 15.0 547 KB

Code repo for EMNLP21 paper "Zero-Shot Information Extraction as a Unified Text-to-Triple Translation"

Home Page: https://arxiv.org/pdf/2109.11171.pdf

License: Apache License 2.0

Python 97.32% Shell 2.68%

deepex's Issues

Would script about model "Magolor/deepex-ranking-model" be released?

Hi,
did you relese a skript about "Magolor/deepex-ranking-model"?
I dont know how to use it.

Help running inference

Hi,
I am unable to figure out how to run OIE inference using the deepex models. Can you please help answering:

Where is the model
how to run OIE on the model
thanks,

Running inference on sentences

Hi,
I am unable to figure out how to use the deepex model. I want to input a sentence and get the triplets generated from it. Is there any code snippet/script in the repo that can help me do that?

Thanks

git clone时出现问题

使用 ‘git clone --recursive [email protected]:cgraywang/deepex.git’ 进行下载时出现：

Cloning into 'deepex'...
Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

请问这个问题的原因是什么，感谢

Poor triple extractor performance (OpenIE)

I followed the README and successfully run the OpenIE16 benchmark, then I modified OIE_2016.json file to point to my directory with test.txt file containing just one line Julia owns two cats and one dog. The output is however really poor, the expected triples [Julia, owns, two cats] and [Julia, owns, one dog] have low scores and there are many other (ill-created) triples, sometimes with even higher score values (complete output attached below).

Is this the normal behavior of the model? Why is the performance so poor? Is there a systematic issue with how I am doing this?

['$input_txt:$ Julia owns two cats and one dog',
 {'deduplicated:': {'Julia [SEP] owns two [SEP] One Dog': [2,
    0.15572701767086983,
    [[0, 5], [24, 31]],
    8,
    0],
   'Julia [SEP] owns two cats [SEP] One Dog': [5,
    0.38142501655966043,
    [[0, 5], [24, 31]],
    21,
    0],
   'Julia [SEP] cats [SEP] One Dog': [2,
    0.09584893216378987,
    [[0, 5], [24, 31]],
    6,
    0],
   'Two Cats [SEP] cats [SEP] One Dog': [2,
    0.09491016250103712,
    [[11, 19], [24, 31]],
    6,
    0],
   'Julia [SEP] owns two cats and [SEP] One Dog': [6,
    0.4070159122347832,
    [[0, 5], [24, 31]],
    26,
    0],
   'Julia [SEP] cats and [SEP] Two Cats': [2,
    0.14055378548800945,
    [[0, 5], [11, 19]],
    9,
    0],
   'Julia [SEP] two cats [SEP] One Dog': [1,
    0.06196947582066059,
    [[0, 5], [24, 31]],
    4,
    0],
   'Julia [SEP] one [SEP] Two Cats': [1,
    0.06188515014946461,
    [[0, 5], [11, 19]],
    4,
    0],
   'Julia [SEP] owns [SEP] Two Cats': [6,
    0.2982198027893901,
    [[0, 5], [11, 19]],
    20,
    0],
   'Julia [SEP] owns [SEP] One Dog': [4,
    0.17877793312072754,
    [[0, 5], [24, 31]],
    12,
    0],
   'Julia [SEP] two [SEP] One Dog': [3,
    0.1447404371574521,
    [[0, 5], [24, 31]],
    10,
    0],
   'Julia [SEP] and one [SEP] Two Cats': [5,
    0.32297115167602897,
    [[0, 5], [11, 19]],
    23,
    0],
   'Two Cats [SEP] owns [SEP] One Dog': [8,
    0.44091942673549056,
    [[11, 19], [24, 31]],
    32,
    0],
   'Julia [SEP] and [SEP] One Dog': [1,
    0.04122000187635422,
    [[0, 5], [24, 31]],
    3,
    0],
   'Julia [SEP] cats and one [SEP] Two Cats': [2,
    0.10967723815701902,
    [[0, 5], [11, 19]],
    8,
    0],
   'Two Cats [SEP] one [SEP] One Dog': [2,
    0.08130411058664322,
    [[11, 19], [24, 31]],
    6,
    0],
   'Two Cats [SEP] owns two [SEP] One Dog': [8,
    0.4609282175078988,
    [[11, 19], [24, 31]],
    36,
    0],
   'Julia [SEP] and [SEP] Two Cats': [3,
    0.14550211280584335,
    [[0, 5], [11, 19]],
    12,
    0],
   'Two Cats [SEP] two [SEP] One Dog': [4,
    0.19086267473176122,
    [[11, 19], [24, 31]],
    16,
    0],
   'Two Cats [SEP] and [SEP] One Dog': [6,
    0.1381131475791335,
    [[11, 19], [24, 31]],
    21,
    0]}}]

The file of P0_result.json is empty

The file of P0_result.json is empty.

Evaluation on BenchIE

An Evaluation on BenchIE would be great. Since it is probably the best available dataset.

Link to Dataset: https://paperswithcode.com/dataset/benchie

Reproducing results shown in paper

Could you described the steps to reproduce the results in your paper for OIE 2016 and the other systems. You used the supervised-oie repo and the benchmark included in there. There is also a related repo from the same author oie-benchmark with the same evaluation. However they are slightly different and have had updates and neither seems to work out of the box when running the eval.sh script which should produce the PR Curve plots (part of what i was interested in comparing).

Looking at them its unclear which version of the benchmark corpus is being used and whether you picked test (or dev) split or used the whole dataset. The default script seems to use all the data but when going through the tasks you provide it seems to use the test split.

Thanks

Tony

About the reproduce results of deepex

Hi, thank you for your good work in OIE2016, but I use your default parameters with 1 V100 GPU, but I get the following results:

Did I have some mistakes that I cannot get the 0.72 F1?

Also, I find that the constractive pretrain deep-ranking-model has not been released and the code is inference code now. It needs about 3 hours to test. Am I right?

Hope for your reply!

Regarding the Task-agnostic Corpus

Thank you for your great work on zero-shot IE. I am very impressed with the results while I am very curious on what is the 'task agnostic corpus', is it an open dataset?

Unable to run bash tasks/OIE_2016.sh

PyTorch 1.7.1 with CUDA 10.1, single GPU.

After running

git clone --recursive [email protected]:cgraywang/deepex.git
cd ./deepex
conda create --name deepex python=3.7 -y
conda activate deepex
pip install -r requirements.txt
pip install -e .

bash tasks/OIE_2016.sh

I am getting

FileNotFoundError: [Errno 2] No such file or directory: 'result/OIE_2016.bert-large-cased.np.d2048.b6.sorted/P0_result.json'

I suspect that because the output file was not generated, so digging in the trace log I found:

RuntimeError
: RuntimeErrorRuntimeError
NCCL error in: /opt/conda/conda-bld/pytorch_1607370141920/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8: RuntimeError: 
NCCL error in: /opt/conda/conda-bld/pytorch_1607370141920/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8: NCCL error in: /opt/conda/conda-bld/pytorch_1607370141920/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8RuntimeError

And some other NCCL errors.
I am not sure where to start debugging.

Thank you in advance.

I am trying to do a simple text-to-triple inference task (openIE) and I thought that this would be the way to start.

Details of output from OIE_2016.sh

Hello, thanks for your contribution to Open Information Extraction research.

I'm currently working on using your repo to create triples from the raw texts in Wikipedia by using your code.

bash tasks/OIE_2016.sh

I changed the data_dir to construct triples then executed the above code.
Code works fine, and it gave me the output called search_res.json.

The problem is, I can't find the description for the format of this output.
I can't find the keys for each value of this output.

{"deduplicated:": {"Another Temporary Gallery [SEP] gallery [SEP] The Museum": [8, 0.7081716619431973, [[0, 25], [65, 75]], 27, 0] ...

For example, the thing I want to know is this list.
[8, 0.7081716619431973, [[0, 25], [65, 75]], 27, 0]
Coudl you briefly explain what each of this value mean?

测试 Born in Glasgow, Fisher is a graduate of the London Opera Centre 生成三元组结果

使用论文中举的例子，”Born in Glasgow, Fisher is a graduate of the London Opera Centre. “ ，在第一步generator中抽取的三元组结果比较混乱，不知是否正常（OIE2016结果复现没有问题，top3 f1=72.6）。模型为默认的bert-large-cased。
search_res.json.txt

Would script about model "Magolor/deepex-ranking-model" be released?

Thank you for great work in zero-shot IE. Where can I find the related script about the pre-training about Magolor/deepex-ranking-model

ValueError When Running OIE on New Sentence

When I tried to run OIE on a new sentence, I got ValueError: y_true takes value in {} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly. The only modification was changing the content of supervised-oie/supervised-oie-benchmark/raw_sentences/test.txt to just

Julia owns two cats and one dog.

As in #17

Please help @cgraywang @jesseLiu2000 @filip-cermak

(deepex_new) root@c605ffdb427a:/home/zhanwen/deepex_new# bash tasks/OIE_2016.sh
Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, 16-bits training: False
Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: False
08/14/2023 22:24:07 - WARNING - __main__ -   Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, 16-bits training: False
08/14/2023 22:24:07 - WARNING - __main__ -   Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: False
Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: False
08/14/2023 22:24:07 - WARNING - __main__ -   Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: False
Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, 16-bits training: False
08/14/2023 22:24:07 - WARNING - __main__ -   Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, 16-bits training: False
Some weights of the model checkpoint at bert-large-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-large-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-large-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-large-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
time spent on loading data augmentation: 0.8261523246765137s
08/14/2023 22:24:13 - INFO - __main__ -   time spent on loading data augmentation: 0.8261523246765137s
create batch examples...: 1it [00:00, 2337.96it/s]                                                               | 0/1 [00:00<?, ?it/s]
time spent on loading data augmentation: 0.8587894439697266s
08/14/2023 22:24:14 - INFO - __main__ -   time spent on loading data augmentation: 0.8587894439697266s
Generate dataset and results:   0%|                                                                              | 0/4 [00:00<?, ?it/s]08/14/2023 22:24:14 - INFO - deepex.model.kgm -   ***** Running Generate_triplets *****
08/14/2023 22:24:14 - INFO - deepex.model.kgm -     Num examples = 1
08/14/2023 22:24:14 - INFO - deepex.model.kgm -     Batch size = 16
                                                                                                                                      time spent on loading data augmentation: 0.851825475692749s
08/14/2023 22:24:14 - INFO - __main__ -   time spent on loading data augmentation: 0.851825475692749s
create batch examples...: 1it [00:00, 1945.41it/s]                                                               | 0/4 [00:00<?, ?it/s]
time spent on loading data augmentation: 0.84853196144104s                                                       | 0/1 [00:00<?, ?it/s]
08/14/2023 22:24:14 - INFO - __main__ -   time spent on loading data augmentation: 0.84853196144104s
create batch examples...: 1it [00:00, 2023.30it/s]                                                               | 0/4 [00:00<?, ?it/s]
create batch examples...: 1it [00:00, 2041.02it/s]t/s]
Generate batch dataset and results: 0it [00:00, ?it/s]                                                                                08/14/2023 22:24:14 - INFO - deepex.model.kgm -   forward time cost 0.2758486270904541s
08/14/2023 22:24:15 - INFO - deepex.model.kgm -   search time cost 0.8128805160522461s
Generate_triplets: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.31s/it]
total producing triplets time: 1.6015987396240234s                                                               | 0/1 [00:00<?, ?it/s]
08/14/2023 22:24:15 - INFO - __main__ -   total producing triplets time: 1.6015987396240234s
total dump triplets time: 0.0033626556396484375s
08/14/2023 22:24:15 - INFO - __main__ -   total dump triplets time: 0.0033626556396484375s███████████████| 1/1 [00:01<00:00,  1.31s/it]
convert batch examples to features...: 1it [00:01,  1.62s/it]
process feature files...: 1it [00:01,  1.62s/it] 1.62s/it]
Generate batch dataset and results: 1it [00:01,  1.62s/it]
Generate dataset and results: 100%|██████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.70s/it]
total time: 2.525900363922119s
08/14/2023 22:24:15 - INFO - __main__ -   total time: 2.525900363922119s
Generate_triplets: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.34s/it]
total producing triplets time: 1.468618631362915s
08/14/2023 22:24:15 - INFO - __main__ -   total producing triplets time: 1.468618631362915s
total dump triplets time: 0.0034074783325195312s
08/14/2023 22:24:15 - INFO - __main__ -   total dump triplets time: 0.0034074783325195312s███████████████| 1/1 [00:01<00:00,  1.34s/it]
convert batch examples to features...: 1it [00:01,  1.48s/it]
process feature files...: 1it [00:01,  1.48s/it] 1.48s/it]
Generate batch dataset and results: 1it [00:01,  1.48s/it]
Generate dataset and results: 100%|██████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.57it/s]
total time: 2.4193410873413086s
08/14/2023 22:24:15 - INFO - __main__ -   total time: 2.4193410873413086s
Generate_triplets: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.34s/it]
total producing triplets time: 1.5090875625610352s
08/14/2023 22:24:15 - INFO - __main__ -   total producing triplets time: 1.5090875625610352s
total dump triplets time: 0.004492044448852539s
08/14/2023 22:24:15 - INFO - __main__ -   total dump triplets time: 0.004492044448852539s████████████████| 1/1 [00:01<00:00,  1.34s/it]
convert batch examples to features...: 1it [00:01,  1.52s/it]
process feature files...: 1it [00:01,  1.52s/it] 1.52s/it]
Generate batch dataset and results: 1it [00:01,  1.52s/it]
Generate dataset and results: 100%|██████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.50it/s]
total time: 2.4521644115448s
08/14/2023 22:24:15 - INFO - __main__ -   total time: 2.4521644115448s
Generate_triplets: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.48s/it]
total producing triplets time: 1.6742017269134521s
08/14/2023 22:24:16 - INFO - __main__ -   total producing triplets time: 1.6742017269134521s
total dump triplets time: 0.0035593509674072266s
08/14/2023 22:24:16 - INFO - __main__ -   total dump triplets time: 0.0035593509674072266s███████████████| 1/1 [00:01<00:00,  1.47s/it]
convert batch examples to features...: 1it [00:01,  1.68s/it]
process feature files...: 1it [00:01,  1.68s/it] 1.68s/it]
Generate batch dataset and results: 1it [00:01,  1.68s/it]
Generate dataset and results: 100%|██████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.27it/s]
total time: 2.614199638366699s
08/14/2023 22:24:16 - INFO - __main__ -   total time: 2.614199638366699s
deduplicating batch:   0%|                                                                                       | 0/4 [00:00<?, ?it/s]output/classified/OIE_2016/P0/0_BertTokenizerFast_NPMentionGenerator_256_0_1/search_res.json
deduplicating doc: 100%|████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 201.03it/s]
output/classified/OIE_2016/P0/0_BertTokenizerFast_NPMentionGenerator_256_0_2/search_res.json                     | 0/1 [00:00<?, ?it/s]
deduplicating doc: 100%|████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 207.29it/s]
output/classified/OIE_2016/P0/0_BertTokenizerFast_NPMentionGenerator_256_0_3/search_res.json                     | 0/1 [00:00<?, ?it/s]
deduplicating doc: 100%|████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 207.98it/s]
output/classified/OIE_2016/P0/0_BertTokenizerFast_NPMentionGenerator_256_0_0/search_res.json                     | 0/1 [00:00<?, ?it/s]
deduplicating doc: 100%|████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 206.81it/s]
deduplicating batch: 100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 177.91it/s]
sorting: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4064.25it/s]
merging doc: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 9822.73it/s]
total triplets: 2560
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.81s/it]
/root/miniconda3/envs/deepex_new/lib/python3.7/site-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.preprocessing.data module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.preprocessing. Anything that cannot be imported from sklearn.preprocessing is now part of the private API.
  warnings.warn(message, FutureWarning)
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
INFO:root:Writing PR curve of DeepEx to eval_data/OIE_2016/deepex.oie_2016.3.dat
Traceback (most recent call last):
  File "benchmark.py", line 231, in <module>
    error_file = args["--error-file"])
  File "benchmark.py", line 101, in compare
    recallMultiplier = ((correctTotal - unmatchedCount)/float(correctTotal)))
  File "benchmark.py", line 125, in prCurve
    precision_ls, recall_ls, thresholds = precision_recall_curve(y_true, y_scores)
  File "/root/miniconda3/envs/deepex_new/lib/python3.7/site-packages/sklearn/metrics/_ranking.py", line 653, in precision_recall_curve
    sample_weight=sample_weight)
  File "/root/miniconda3/envs/deepex_new/lib/python3.7/site-packages/sklearn/metrics/_ranking.py", line 544, in _binary_clf_curve
    classes_repr=classes_repr))
ValueError: y_true takes value in {} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
OIE_2016 (top 3)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.