shmsw25 / ambigqa Goto Github PK
View Code? Open in Web Editor NEWAn original implementation of EMNLP 2020, "AmbigQA: Answering Ambiguous Open-domain Questions"
Home Page: https://arxiv.org/abs/2004.10645
An original implementation of EMNLP 2020, "AmbigQA: Answering Ambiguous Open-domain Questions"
Home Page: https://arxiv.org/abs/2004.10645
For evaluation on NQ, what exactly is id2answers? I noticed that you set self.data[i]["answer"] += id2answers[d["id"]]
for training but self.data[i]["answer"] = id2answers[d["id"]]
for evaluation, may I know what's the distinction?
Thanks.
In README of the codes directory, in Step 1 under DPR Retrieval, the download resource name should be "checkpoint.retriever.multiset.bert-base-encoder" ("multiset" rather than "multi"), to be consistent with that in download_data.py.
Lines 201 to 217 in 74f9ff8
After line 207, I think the attention_mask
should also be cropped as attention_mask[idx] = curr_attention_mask[:end_of_question]
. Otherwise, it will not be aligned with input_ids
.
How do I get the context in AmbigNQ? I downloaded the extra resources, but the id in AmbigNQ doesn't correspond to the id in sqlite3. I also downloaded the data for the NQ, and although example ids correspond to each other, the data for AmbigNQ appears to have been processed as plain, but the data for the original NQ is not. A number of questions arise about the match, and do the start and end positions of the answer not need to be?
Hi @shmsw25 , I notice you mentioned two versions of NQ answers in README, and {train|dev|test}_id2answers.json
files are answers released by google. However, I check answers in the original version and find that in some cases, they are different from your released files. In some cases, answers in {train|dev|test}_id2answers.json
are empty strings but non-empty in the original version.
Hi @shmsw25 While I was reading the paper as well as the code, I didn't find the disambig-first baseline code in this codebase. Does the disambig-first baseline code include here? Thanks very much!
Hi Sewon,
I found in the table https://github.com/shmsw25/AmbigQA/tree/master/codes#aggregated-results the performance appears to improve significantly compared to the reported results (e.g., 42.2 -> 45 on NQ-open, 30.8 -> 35.8 on AmbigQA zero-shot). I wonder besides "re-implemented the models with Huggingface Transformers", what other changes were made to achieve such performance boosts (e.g., did you change parameters like train_batch_size, train_M)? Also, if I understand correctly, SpanSeqGen (this code) uses BART-base, which outperforms DPR BERT-large despite fewer parameters, right?
Thank you!
Hi,
Thanks for sharing your work!
I'm a little confused about the _take_mml function. Could you tell me the reason why you add - 1e10 * (loss_tensor==0).float()
here? Thanks!
Hi @shmsw25, in the evaluation script, F1 bleu1~4
are computed:
AmbigQA/ambigqa_evaluate_script.py
Line 28 in f0f17a2
This makes me a little bit confused since in the paper only F1 BLEU
is reported.
Is F1 BLEU
in the paper an average over all n-gram BLEUs?
What's the correct way to compare the results from the evaluation scripts and the scores reported in the paper?
Hi, thanks for the great work and dataset. I am just wondering will the implementation of SpanSeqGen be released? If so, when will it be public? Thanks.
Hi Sewon,
For the provided checkpoint named "DPR Reader trained on AmbigNQ (387M)", is it trained just on the AmbigNQ training set or NQ + AmbigNQ training set? I tried to train a DPR Reader with BERT-base span extractor on AmbigNQ training set alone and the result I got seems to be lower than the provided checkpoint.
Best,
Chenglei
Hi,
The read.me refers to this file in the section about interactive reader, but I cannot find it.
Has it been released?
Thank you
Hi there: just a suggestion that it may be nice to rename the default branch of this repository to something else (like 'main') to avoid using non-inclusive names?
There is a simple tutorial on how to do this for Github repositories here: https://github.com/github/renaming
In codes/models/span_predictor.py, the classifier layer is defined as:
self.qa_classifier = nn.Linear(config.hidden_size, 1)
But shouldn't it be:
self.qa_classifier = nn.Linear(config.hidden_size, config.num_labels)
(where config.num_labels=2)?
For ambignq/dev.json
, should it be Cristiano Ronaldo
? Or before 2001, seems the highest goal scorer in men's world international football before 2001 was Pelé of Brazil
.
"qaPairs": [
{
"question": "Who has the highest goals in men's world international football?",
"answer": [
"Daei",
"Ali Daei"
]
},
{
"question": "Who has the highest goals all-time in men's football?",
"answer": [
"Bican",
"Josef Bican"
]
},
...
]
It seems unanswerable questions are not considered, is that right?
I was browsing through the dataset (dev_light.json). I find it weird that some examples in the json file contains multiple repeated items, for example:
[{'type': 'singleAnswer', 'answer': ['Nick Robinson']}, {'type': 'singleAnswer', 'answer': ['Nick Robinson']}, {'type': 'singleAnswer', 'answer': ['Nick Robinson']}]
[{'type': 'singleAnswer', 'answer': ['21']}, {'type': 'singleAnswer', 'answer': ['21']}, {'type': 'singleAnswer', 'answer': ['twenty-one', '21']}]
while some others are in the form like:
[{'type': 'singleAnswer', 'answer': ['November 16, 2003', '16 November 2003', 'November 16th, 2003']}]
The same dict was repeated 3 times in the first list. Is this done by purpose as part of the annotation? And for evaluation we can just treat the set of all unique answer strings as the correct answers right?
Hi,
I'm trying to reproduce the SpanPredictor + Thresholding baseline but have some questions about the actual implementation:
In line 79 of span_predictor.py, you set sel_labels = torch.zeros(N, dtype=torch.long).cuda()
. My understanding is that sel module predicts whether the passage contains a correct answer or not. If that's the case, why is sel_labels set to all zero?
I've browsed through some data and find that some passages contain multiple correct answer spans. In such cases, did you take into account all the detected spans during training? (i.e., each passage can have multiple different correct answer spans and we optimise likelihood over all of them instead of just one of them.)
How is the threshold selected exactly? I didn't seem to find the exact function / code snippet for selecting the threshold of predicting how many answers. (You used np.log(0.05)
in AmbigQAData's evaluate function but I'm not sure where did that number come from.)
Thanks.
Hi @shmsw25 , while running the command,
for i in 0 1 2 3 4 5 6 7 8 9 ; do \ # for parallelization python3 cli.py --do_predict --bert_name bert-base-uncased --output_dir out/dpr --do_predict --task dpr --predict_batch_size 3200 --db_index $i \ done
One error is coming, assert args.bert_name=="bart-large"
, in PassageData.py .
Can you please help ?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.