shmsw25 / ambigqa Goto Github PK

View Code? Open in Web Editor NEW

116.0 116.0 23.0 6.73 MB

An original implementation of EMNLP 2020, "AmbigQA: Answering Ambiguous Open-domain Questions"

Home Page: https://arxiv.org/abs/2004.10645

Python 100.00%

dataset nlp question-answering

ambigqa's People

Contributors

Stargazers

Watchers

ambigqa's Issues

Evaluation Question

For evaluation on NQ, what exactly is id2answers? I noticed that you set self.data[i]["answer"] += id2answers[d["id"]] for training but self.data[i]["answer"] = id2answers[d["id"]] for evaluation, may I know what's the distinction?

Thanks.

A Minor Typo

In README of the codes directory, in Step 1 under DPR Retrieval, the download resource name should be "checkpoint.retriever.multiset.bert-base-encoder" ("multiset" rather than "multi"), to be consistent with that in download_data.py.

In BART generation, attention_mask is not aligned with input_id

AmbigQA/codes/Data.py

Lines 201 to 217 in 74f9ff8

 for idx, (curr_input_ids, curr_attention_mask, curr_metadata, dpr_ids) in enumerate(zip( 

 input_ids, attention_mask, metadata, dpr_passages)): 

 dpr_input_ids = [self.passages.tokenized_data["input_ids"][_id] for _id in dpr_ids] 

 dpr_attention_mask = [self.passages.tokenized_data["attention_mask"][_id] for _id in dpr_ids] 

 offset = 0 

 end_of_question = curr_input_ids.index(self.tokenizer.eos_token_id)+1 

 input_ids[idx] = curr_input_ids[:end_of_question] 

 # 7.75 approximately 

 while len(input_ids[idx])<1024: 

 assert dpr_input_ids[offset][0] == bos_token_id 

 assert len(dpr_input_ids[offset])==len(dpr_attention_mask[offset]) 

 assert np.sum(dpr_attention_mask[offset])==len(dpr_attention_mask[offset]) 

 input_ids[idx] += dpr_input_ids[offset][1:] 

 attention_mask[idx] += dpr_attention_mask[offset][1:] 

 offset += 1 

 input_ids[idx] = input_ids[idx][:1024] 

 attention_mask[idx] = attention_mask[idx][:1024]

After line 207, I think the attention_mask should also be cropped as attention_mask[idx] = curr_attention_mask[:end_of_question]. Otherwise, it will not be aligned with input_ids.

How do I get the context in AmbigNQ?

How do I get the context in AmbigNQ? I downloaded the extra resources, but the id in AmbigNQ doesn't correspond to the id in sqlite3. I also downloaded the data for the NQ, and although example ids correspond to each other, the data for AmbigNQ appears to have been processed as plain, but the data for the original NQ is not. A number of questions arise about the match, and do the start and end positions of the answer not need to be?

About {train|dev|test}_id2answers.json

Hi @shmsw25 , I notice you mentioned two versions of NQ answers in README, and {train|dev|test}_id2answers.json files are answers released by google. However, I check answers in the original version and find that in some cases, they are different from your released files. In some cases, answers in {train|dev|test}_id2answers.json are empty strings but non-empty in the original version.

DisAmbig-First baseline code

Hi @shmsw25 While I was reading the paper as well as the code, I didn't find the disambig-first baseline code in this codebase. Does the disambig-first baseline code include here? Thanks very much!

Detailed differences between "reported" and "this code"

Hi Sewon,

I found in the table https://github.com/shmsw25/AmbigQA/tree/master/codes#aggregated-results the performance appears to improve significantly compared to the reported results (e.g., 42.2 -> 45 on NQ-open, 30.8 -> 35.8 on AmbigQA zero-shot). I wonder besides "re-implemented the models with Huggingface Transformers", what other changes were made to achieve such performance boosts (e.g., did you change parameters like train_batch_size, train_M)? Also, if I understand correctly, SpanSeqGen (this code) uses BART-base, which outperforms DPR BERT-large despite fewer parameters, right?

Thank you!

Questions about _take_mml function

Hi,
Thanks for sharing your work!
I'm a little confused about the _take_mml function. Could you tell me the reason why you add - 1e10 * (loss_tensor==0).float() here? Thanks!

About BLEU score in the evaluation script and in the paper

Hi @shmsw25, in the evaluation script, F1 bleu1~4 are computed:

AmbigQA/ambigqa_evaluate_script.py

Line 28 in f0f17a2

METRICS_QG = ["F1 bleu1", "F1 bleu2", "F1 bleu3", "F1 bleu4", "F1 edit-f1"]

This makes me a little bit confused since in the paper only F1 BLEU is reported.
Is F1 BLEU in the paper an average over all n-gram BLEUs?
What's the correct way to compare the results from the evaluation scripts and the scores reported in the paper?

Is there any plan to release the implementation of SpanSeqGen?

Hi, thanks for the great work and dataset. I am just wondering will the implementation of SpanSeqGen be released? If so, when will it be public? Thanks.

Question About the Results

Hi Sewon,

For the provided checkpoint named "DPR Reader trained on AmbigNQ (387M)", is it trained just on the AmbigNQ training set or NQ + AmbigNQ training set? I tried to train a DPR Reader with BERT-base span extractor on AmbigNQ training set alone and the result I got seems to be lower than the provided checkpoint.

Best,
Chenglei

Where is InteractiveDPR.py?

Hi,

The read.me refers to this file in the section about interactive reader, but I cannot find it.
Has it been released?

Thank you

Rename default branch to 'main'?

Hi there: just a suggestion that it may be nice to rename the default branch of this repository to something else (like 'main') to avoid using non-inclusive names?

There is a simple tutorial on how to do this for Github repositories here: https://github.com/github/renaming

Potential Error in SpanPredictor Codes

In codes/models/span_predictor.py, the classifier layer is defined as:
self.qa_classifier = nn.Linear(config.hidden_size, 1)

But shouldn't it be:
self.qa_classifier = nn.Linear(config.hidden_size, config.num_labels) (where config.num_labels=2)?

sample answer in ambignq/dev.json

For ambignq/dev.json, should it be Cristiano Ronaldo? Or before 2001, seems the highest goal scorer in men's world international football before 2001 was Pelé of Brazil.

"qaPairs": [
                    {
                        "question": "Who has the highest goals in men's world international football?",
                        "answer": [
                            "Daei",
                            "Ali Daei"
                        ]
                    },
                    {
                        "question": "Who has the highest goals all-time in men's football?",
                        "answer": [
                            "Bican",
                            "Josef Bican"
                        ]
                    },
                 ...
                ]

About unanswerable questions

It seems unanswerable questions are not considered, is that right?

Repeated Items in dev_light.json

I was browsing through the dataset (dev_light.json). I find it weird that some examples in the json file contains multiple repeated items, for example:
[{'type': 'singleAnswer', 'answer': ['Nick Robinson']}, {'type': 'singleAnswer', 'answer': ['Nick Robinson']}, {'type': 'singleAnswer', 'answer': ['Nick Robinson']}]
[{'type': 'singleAnswer', 'answer': ['21']}, {'type': 'singleAnswer', 'answer': ['21']}, {'type': 'singleAnswer', 'answer': ['twenty-one', '21']}]

while some others are in the form like:
[{'type': 'singleAnswer', 'answer': ['November 16, 2003', '16 November 2003', 'November 16th, 2003']}]

The same dict was repeated 3 times in the first list. Is this done by purpose as part of the annotation? And for evaluation we can just treat the set of all unique answer strings as the correct answers right?

Questions about SpanPredictor baseline

Hi,

I'm trying to reproduce the SpanPredictor + Thresholding baseline but have some questions about the actual implementation:

In line 79 of span_predictor.py, you set sel_labels = torch.zeros(N, dtype=torch.long).cuda(). My understanding is that sel module predicts whether the passage contains a correct answer or not. If that's the case, why is sel_labels set to all zero?
I've browsed through some data and find that some passages contain multiple correct answer spans. In such cases, did you take into account all the detected spans during training? (i.e., each passage can have multiple different correct answer spans and we optimise likelihood over all of them instead of just one of them.)
How is the threshold selected exactly? I didn't seem to find the exact function / code snippet for selecting the threshold of predicting how many answers. (You used np.log(0.05) in AmbigQAData's evaluate function but I'm not sure where did that number come from.)

Thanks.

Error while creating passage vectors

Hi @shmsw25 , while running the command,
for i in 0 1 2 3 4 5 6 7 8 9 ; do \ # for parallelization python3 cli.py --do_predict --bert_name bert-base-uncased --output_dir out/dpr --do_predict --task dpr --predict_batch_size 3200 --db_index $i \ done
One error is coming, assert args.bert_name=="bart-large", in PassageData.py .
Can you please help ?

	for idx, (curr_input_ids, curr_attention_mask, curr_metadata, dpr_ids) in enumerate(zip(
	input_ids, attention_mask, metadata, dpr_passages)):
	dpr_input_ids = [self.passages.tokenized_data["input_ids"][_id] for _id in dpr_ids]
	dpr_attention_mask = [self.passages.tokenized_data["attention_mask"][_id] for _id in dpr_ids]
	offset = 0
	end_of_question = curr_input_ids.index(self.tokenizer.eos_token_id)+1
	input_ids[idx] = curr_input_ids[:end_of_question]
	# 7.75 approximately
	while len(input_ids[idx])<1024:
	assert dpr_input_ids[offset][0] == bos_token_id
	assert len(dpr_input_ids[offset])==len(dpr_attention_mask[offset])
	assert np.sum(dpr_attention_mask[offset])==len(dpr_attention_mask[offset])
	input_ids[idx] += dpr_input_ids[offset][1:]
	attention_mask[idx] += dpr_attention_mask[offset][1:]
	offset += 1
	input_ids[idx] = input_ids[idx][:1024]
	attention_mask[idx] = attention_mask[idx][:1024]

shmsw25 / ambigqa Goto Github PK

ambigqa's People

Contributors

Stargazers

Watchers

Forkers

ambigqa's Issues

Recommend Projects

Recommend Topics

Recommend Org