Giter VIP home page Giter VIP logo

ambigqa's People

Contributors

shmsw25 avatar sycoraxx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

ambigqa's Issues

Evaluation Question

For evaluation on NQ, what exactly is id2answers? I noticed that you set self.data[i]["answer"] += id2answers[d["id"]] for training but self.data[i]["answer"] = id2answers[d["id"]] for evaluation, may I know what's the distinction?

Thanks.

A Minor Typo

In README of the codes directory, in Step 1 under DPR Retrieval, the download resource name should be "checkpoint.retriever.multiset.bert-base-encoder" ("multiset" rather than "multi"), to be consistent with that in download_data.py.

In BART generation, attention_mask is not aligned with input_id

AmbigQA/codes/Data.py

Lines 201 to 217 in 74f9ff8

for idx, (curr_input_ids, curr_attention_mask, curr_metadata, dpr_ids) in enumerate(zip(
input_ids, attention_mask, metadata, dpr_passages)):
dpr_input_ids = [self.passages.tokenized_data["input_ids"][_id] for _id in dpr_ids]
dpr_attention_mask = [self.passages.tokenized_data["attention_mask"][_id] for _id in dpr_ids]
offset = 0
end_of_question = curr_input_ids.index(self.tokenizer.eos_token_id)+1
input_ids[idx] = curr_input_ids[:end_of_question]
# 7.75 approximately
while len(input_ids[idx])<1024:
assert dpr_input_ids[offset][0] == bos_token_id
assert len(dpr_input_ids[offset])==len(dpr_attention_mask[offset])
assert np.sum(dpr_attention_mask[offset])==len(dpr_attention_mask[offset])
input_ids[idx] += dpr_input_ids[offset][1:]
attention_mask[idx] += dpr_attention_mask[offset][1:]
offset += 1
input_ids[idx] = input_ids[idx][:1024]
attention_mask[idx] = attention_mask[idx][:1024]

After line 207, I think the attention_mask should also be cropped as attention_mask[idx] = curr_attention_mask[:end_of_question]. Otherwise, it will not be aligned with input_ids.

How do I get the context in AmbigNQ?

How do I get the context in AmbigNQ? I downloaded the extra resources, but the id in AmbigNQ doesn't correspond to the id in sqlite3. I also downloaded the data for the NQ, and although example ids correspond to each other, the data for AmbigNQ appears to have been processed as plain, but the data for the original NQ is not. A number of questions arise about the match, and do the start and end positions of the answer not need to be?

About {train|dev|test}_id2answers.json

Hi @shmsw25 , I notice you mentioned two versions of NQ answers in README, and {train|dev|test}_id2answers.json files are answers released by google. However, I check answers in the original version and find that in some cases, they are different from your released files. In some cases, answers in {train|dev|test}_id2answers.json are empty strings but non-empty in the original version.

DisAmbig-First baseline code

Hi @shmsw25 While I was reading the paper as well as the code, I didn't find the disambig-first baseline code in this codebase. Does the disambig-first baseline code include here? Thanks very much!

Detailed differences between "reported" and "this code"

Hi Sewon,

I found in the table https://github.com/shmsw25/AmbigQA/tree/master/codes#aggregated-results the performance appears to improve significantly compared to the reported results (e.g., 42.2 -> 45 on NQ-open, 30.8 -> 35.8 on AmbigQA zero-shot). I wonder besides "re-implemented the models with Huggingface Transformers", what other changes were made to achieve such performance boosts (e.g., did you change parameters like train_batch_size, train_M)? Also, if I understand correctly, SpanSeqGen (this code) uses BART-base, which outperforms DPR BERT-large despite fewer parameters, right?

Thank you!

About BLEU score in the evaluation script and in the paper

Hi @shmsw25, in the evaluation script, F1 bleu1~4 are computed:

METRICS_QG = ["F1 bleu1", "F1 bleu2", "F1 bleu3", "F1 bleu4", "F1 edit-f1"]

This makes me a little bit confused since in the paper only F1 BLEU is reported.
Is F1 BLEU in the paper an average over all n-gram BLEUs?
What's the correct way to compare the results from the evaluation scripts and the scores reported in the paper?

Question About the Results

Hi Sewon,

For the provided checkpoint named "DPR Reader trained on AmbigNQ (387M)", is it trained just on the AmbigNQ training set or NQ + AmbigNQ training set? I tried to train a DPR Reader with BERT-base span extractor on AmbigNQ training set alone and the result I got seems to be lower than the provided checkpoint.

Best,
Chenglei

Where is InteractiveDPR.py?

Hi,

The read.me refers to this file in the section about interactive reader, but I cannot find it.
Has it been released?

Thank you

Potential Error in SpanPredictor Codes

In codes/models/span_predictor.py, the classifier layer is defined as:
self.qa_classifier = nn.Linear(config.hidden_size, 1)

But shouldn't it be:
self.qa_classifier = nn.Linear(config.hidden_size, config.num_labels) (where config.num_labels=2)?

sample answer in ambignq/dev.json

For ambignq/dev.json, should it be Cristiano Ronaldo? Or before 2001, seems the highest goal scorer in men's world international football before 2001 was Pelé of Brazil.

"qaPairs": [
                    {
                        "question": "Who has the highest goals in men's world international football?",
                        "answer": [
                            "Daei",
                            "Ali Daei"
                        ]
                    },
                    {
                        "question": "Who has the highest goals all-time in men's football?",
                        "answer": [
                            "Bican",
                            "Josef Bican"
                        ]
                    },
                 ...
                ]

Repeated Items in dev_light.json

I was browsing through the dataset (dev_light.json). I find it weird that some examples in the json file contains multiple repeated items, for example:
[{'type': 'singleAnswer', 'answer': ['Nick Robinson']}, {'type': 'singleAnswer', 'answer': ['Nick Robinson']}, {'type': 'singleAnswer', 'answer': ['Nick Robinson']}]
[{'type': 'singleAnswer', 'answer': ['21']}, {'type': 'singleAnswer', 'answer': ['21']}, {'type': 'singleAnswer', 'answer': ['twenty-one', '21']}]

while some others are in the form like:
[{'type': 'singleAnswer', 'answer': ['November 16, 2003', '16 November 2003', 'November 16th, 2003']}]

The same dict was repeated 3 times in the first list. Is this done by purpose as part of the annotation? And for evaluation we can just treat the set of all unique answer strings as the correct answers right?

Questions about SpanPredictor baseline

Hi,

I'm trying to reproduce the SpanPredictor + Thresholding baseline but have some questions about the actual implementation:

  1. In line 79 of span_predictor.py, you set sel_labels = torch.zeros(N, dtype=torch.long).cuda(). My understanding is that sel module predicts whether the passage contains a correct answer or not. If that's the case, why is sel_labels set to all zero?

  2. I've browsed through some data and find that some passages contain multiple correct answer spans. In such cases, did you take into account all the detected spans during training? (i.e., each passage can have multiple different correct answer spans and we optimise likelihood over all of them instead of just one of them.)

  3. How is the threshold selected exactly? I didn't seem to find the exact function / code snippet for selecting the threshold of predicting how many answers. (You used np.log(0.05) in AmbigQAData's evaluate function but I'm not sure where did that number come from.)

Thanks.

Error while creating passage vectors

Hi @shmsw25 , while running the command,
for i in 0 1 2 3 4 5 6 7 8 9 ; do \ # for parallelization python3 cli.py --do_predict --bert_name bert-base-uncased --output_dir out/dpr --do_predict --task dpr --predict_batch_size 3200 --db_index $i \ done
One error is coming, assert args.bert_name=="bart-large", in PassageData.py .
Can you please help ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.