spacemanidol / msmarco Goto Github PK

View Code? Open in Web Editor NEW

187.0 15.0 41.0 92.04 MB

Utilities, Baselines, Statistics and Descriptions Related to the MSMARCO DATASET

License: MIT License

Python 74.07% Shell 2.18% Jupyter Notebook 23.76%

evaluation-script dataset passages bing machine-reading-comprehension machine-learning

msmarco's People

Contributors

Stargazers

Watchers

msmarco's Issues

Collection paragraph metadata

The collection.tsv contains the paragraph contents. How do we get the metadata? It's in the full documents, but I don't see how these can easily be linked.

Broken eval script link in Ranking/README.md file

In the sentence "The link Official evaluation scripts and samples are availible Here", the link https://github.com/dfcf93/MSMARCOV2/tree/master/Ranking/Evaluation is broken.

KeyError for converttowellformed.py

On running the converttowellformed.py script from the utils, while it works on the dev and train files, it produces a KeyError for the eval file.

python converttowellformed.py eval_v2.1_public.json eval.json

Traceback (most recent call last):
  File "converttowellformed.py", line 14, in <module>
    makewf(sys.argv[1],sys.argv[2])
  File "converttowellformed.py", line 6, in makewf
    df = df.drop('answers',1)
  File "/home/sudeshna/envs/.env/lib/python3.5/site-packages/pandas/core/frame.py", line 3697, in drop
    errors=errors)
  File "/home/sudeshna/envs/.env/lib/python3.5/site-packages/pandas/core/generic.py", line 3111, in drop
    obj = obj._drop_axis(labels, axis, level=level, errors=errors)
  File "/home/sudeshna/envs/.env/lib/python3.5/site-packages/pandas/core/generic.py", line 3143, in _drop_axis
    new_axis = axis.drop(labels, errors=errors)
  File "/home/sudeshna/envs/.env/lib/python3.5/site-packages/pandas/core/indexes/base.py", line 4404, in drop
    '{} not found in axis'.format(labels[mask]))
KeyError: "['answers'] not found in axis"

The json file created after running the converttowellformed.py script on the dev and train files do not contain wellFormedAnswers as a key.

with jsonlines.open('dev.jsonl') as reader:
	for obj in reader:
            print(obj['query'])
	    print(obj['wellFormedAnswers'])

albany mn population
Traceback (most recent call last):
  File "jsonl_reader.py", line 8, in <module>
    if obj['wellFormedAnswers']:
KeyError: 'wellFormedAnswers'

(I have converted the json file to jsonl prior to accessing it for the second example.)

No module named mrcqa.modules

This error is encountered while trying to run scripts/train.py. It seems that mrcqa.modules is not present.

Include statistics on ranking dataset in documentation

Hi,

Responding to the request of feedback on the documentation, I have a suggestion.

To me it would have been helpful if the size of each split of the dataset were included in the documentation as listed in issue #11. Additionally, it would be interesting to include other characteristics of the dataset such as the average question length, the average passage length, the amount of unique passages included in the top 1000 ranking by BM25 (assuming this is a subset of the 8 million passages in the whole dataset).

Thanks in advance!

Invalid line breaks in the top1000 TSV files of the reranking datasets

Describe the bug

A lot of invalid line breaks are contained in the top1000 TSV files of the reranking datasets.
For example, line 234472 in the top1000.dev.tsv does not start with the IDs.

To Reproduce

% sed -n 234471,234472p top1000.dev.tsv
1082445 3492590 what does unlock my device mean iOS: Understanding the SIM PIN.
 You can lock your SIM card so that it can't be used without a Personal Identification Number (PIN). You can use a SIM pin to prevent access to cellular data networks.In order to use cellular data, you must enter the PIN whenever you swap SIM cards or restart your iPhone or iPad (Wi-Fi + Cellular models).hen restoring the device, you will need to unlock the SIM card to complete the restore process. The device and iTunes display the following prompts to notify you: To complete the restore process: 1  Disconnect the device from your computer. 2  Tap Unlock on the device.

Passage IDs in Qna Dataset

Is your feature request related to a problem? Please describe.
For the Passage Ranking task many researchers should be able to join the URL data from the QnA dataset with the Passage ranking passages. Currently, since the QnA dataset does not have any of the passageIDs .
Describe the solution you'd like
Passage Ids in QnA dataset

Describe alternatives you've considered
N/A

Additional context
N/A

[encoding,Â] top1000.dev.tsv

I read the content from top1000.dev.tsv, but met many encoding problems.

For example:

Right text in queries.tsv:what complication is a potential danger associated with continuous iv infusions?

Problem text in top1000.dev.tsv: what complication is a potential danger associated with continuousÂ iv infusions?

There are many places appear this problem which text contain Â in top1000.dev.tsv.

dev_as_references.json is from V1?

It seems the dev_as_references.json is for the V1? Do you have the file from V2?

How were passage reranking triples generated?

Hi,

It's not clear how triples were generated. Your documentation says:

These triples(availible in small and large(27 and 270gb respectively)) contain a query followed by a possitive passage and a negative passage.

But it also says

hen, the existing dataset has an annotation of is_selected:1 if a judge used a passage to generate their answer. We consider these as a ranking signla where all passages that have a value of 1 are a true possitive for query relevance for that given passage. Any passage that has a value of 0 is not a true negative.

How are negative passages generated if is_selected:0 is not true negative. Can you please open source the code used to generate these triples.

I think documentation for the dataset needs work. Given the usefulness of the dataset, it's a shame if people are unable to use it because of documentation.

BM25 relevance values for top 1000 eval/dev?

In the documents:

We collected all unique passages(without any normalization) to make a pool of ~8.8 million unique passages. Then, for each query from the existing MSMARCO splits(train,dev, and eval) we ran a standard BM25 to produce 1000 relevant passages. These were ordered by random so each query now has 1000 corresponding passages.

Why you ordered the top1000 retrieved docs by random and didn't store the BM25 relevance values?

OpenKPAnnotations.tsv for Key Phrase Extraction

Thanks for the great work!

Could you let me know how to access or generate the OpenKPAnnotations.tsv file used in the notebook for key phrase extraction please?

Partially duplicated passages extracted

Describe the bug
This is a bug for the ranking passages collection. The bug is that some passages are partially duplicated. The passage is partially replicated, but is also truncated incorrectly - leading to partial words being cut off.

To Reproduce
Below are three example passages.

A search index is available at:
http://boston.lti.cs.cmu.edu/Services/treccast19

http://boston.lti.cs.cmu.edu/Services/treccast19/lemur.cgi?d=0&i=7564736
Average physician assistant salary. Physician assistant’s salary is ranging from $68,587 to $117,554 pay per year. The average physician assistant’s salary is $87,749. Generally, a new physician assistant earns an hourly pay ranging from $28.13 to 50.00.Read more about average PA salary in USA.verage physician assistant salary. Physician assistant’s salary is ranging from $68,587 to $117,554 pay per year. The average physician assistant’s salary is $87,749. Generally, a new physician assistant earns an hourly pay ranging from $28.13 to 50.00.

In the above the duplicate passage starts at ".verage physician assistant salary.." and continues.

http://boston.lti.cs.cmu.edu/Services/treccast19/lemur.cgi?d=0&i=107258
This is a list of goat breeds. There are many recognized breeds of domestic goat (Capra aegagrus hircus) . Goat breeds (especially dairy goats) are some of the oldest defined animal breeds for which breed standards and production records have been kept.his is a list of goat breeds. There are many recognized breeds of domestic goat (Capra aegagrus hircus) . Goat breeds (especially dairy goats) are some of the oldest defined animal breeds for which breed standards and productio

In the above passage the duplicate portion starts at "There are many recognized breeds" and is truncated not on a word boundary.

http://boston.lti.cs.cmu.edu/Services/treccast19/lemur.cgi?d=0&i=15744716
Jamunapari Goat. Jamunapari goat is a very beautiful dairy goat breed which was originated from India. This breed first introduced near a river of Uttar Pradesh named Jamuna. Since this breed is mostly known as Jamunapari goat. They are also known as some other names like Jamnapari, Ram Sagol etc.They become highly meat and milk productive and also very suitable for show. In India this goat breed is considered as the best dairy goat.They are one of the giant goat breeds with a pair of very long ear. The physical characteristics, feeding, housing, breeding and caring of Jamunapari goat are described below.hey are also known as some other names like Jamnapari, Ram Sagol etc. They become highly meat and milk productive and also very suitable for show. In India this goat breed is considered as the best dairy goat. They are one of the giant goat breeds with a pair of very long ear.

The duplicate section starts at ".hey are also known as some other names like Jamnapari, Ram Sagol etc..." Again started not on a good word boundary.

Additional context
Add any other context about the problem here.

Problem with Utils/tojson.py

There's some problem with the tojson.py; it seems that it ignores the answers in the jsonl file.

Different number of queries in collectionandqueries.tar.gz and top1000.dev.tar.gz

There are different number of unique queries in dev set in collectionandqueries.tar.gz (101093) and top1000.dev.tar.gz (6980).

Could you please update collectionandqueries.tar.gz with the smaller version? Otherwise, 101k queries take too long to evaluate.

Same problem in the eval set.

some bugs about the code

some adaptation is needed for code downloaded. Otherwise, it will generate errors. (Actually we can change the output file also but I simply change the code downloaded)

for Evaluation/bleu/bleu.py line22 assert(list(gts.keys()) == list(res.keys())) should be changed to
assert(sorted(gts.keys()) == sorted(res.keys()))
for Evaluation/rouge/rouge.py line85 assert(list(gts.keys()) == list(res.keys())) should be changed to
assert(sorted(gts.keys()) == sorted(res.keys()))
for Evaluation/ms_marco_eval.py line 132 133
precision = true_positives/(true_positives+false_positives)
recall = true_positives/(true_positives+false_negatives)
should be changed to
precision = float(true_positives)/(true_positives+false_positives) if (true_positives+false_positives)>0 else 1.
recall = float(true_positives)/(true_positives+false_negatives) if (true_positives+false_negatives)>0 else 1.
in case divide zero

keyerror

Why there's a keyerror when I run predict on dev_v2.1.json?
Here is the traceback:
"Augmenting with pre-trained embeddings...
Augmenting with random char embeddings...
Traceback (most recent call last):
File "scripts/predict.py", line 189, in
main()
File "scripts/predict.py", line 182, in main
toks = ' '.join(id_to_token[tok] for tok in toks)
File "scripts/predict.py", line 182, in
toks = ' '.join(id_to_token[tok] for tok in toks)
KeyError: tensor(507946)"

Uncommon train / dev / test split of ranking dataset

Hi,

I have two questions about the train/dev/test split of the ranking dataset. I noted that:

The train set queries.train consists of 502,939 questions, of which all 502,939 have at least 1 answer in qrels.train.
The dev set queries.dev consists of 12,665 questions, of which only 6,986 have at least 1 answer in qrels.dev.
The test set queries.eval consists of 12,560 questions.

Now, my questions are:

Why was roughly a 40:1:1 split made instead of e.g. a more common 8:1:1 split?
Why do only (roughly) 55% of the queries in dev have an answer whereas 100% of the queries in train have an answer?

Thanks in advance!

there are many bugs in the evaluation script

it even cannot run through sample test cases the author provided

Need more explanation about Reranking Dataset

In the relevant passages, I am a bit confused about what the column value in each row means.

Such as

1185869 0       0       1
1185868 0       16      1
597651  0       49      1
403613  0       60      1
1183785 0       389     1
312651  0       616     1
80385   0       723     1
645590  0       944     1
645337  0       1054    1
186154  0       1160    1

Given your explanation, Column 0 is queryID, column 2 is passageID, as what will column 1 and column 3 mean, in this case?

Thanks

Full Document May be incorrect tokenization in document_text

I am inspecting the contents of fulldocuments.json. I notice that the json content has possible issues with "document_text" field. In particular, that the text seems to be somewhat incorrect (possibly due to not inserting breaks?)

Examples:
https://answers.yahoo.com/question/index?qid=20071007114826AAwCFvR
example text:
show moreFollow 3 answersAnswersRelevanceRatingNewestOldestBest Answer

http://childparenting.about.com/od/physicalemotionalgrowth/tp/Child-Development-Your-Eight-Year-Old-Child.htm
.4 Cognitive DevelopmentTom Merton/Getty ImagesEight-year-old children are at a stage of intellectual development where they will be able to pay attention for longer periods of time.

Regarding the Test Set for Q&A

Hi,

In your website, you have, training, dev, and evaluation set for Q&A. I want to know, what is the difference between dev-set and evaluation-set? And where can I find the test-data to submit the result for the leaderboard?

encoding, top1000.train, qrels.train

Hey @dfcf93, I get some questions:

What is the encoding of the text? I read as utf-8, but see incorrect encoding in at collections.tsv and for some passage in triples.train.small, I was not able to find its passageId in collections.tsv.
In top1000.train, there are NOT 1000 paragraphs for each query, not only 1-10 passages. While in top1000.dev, most queries (but not all) do have 1000 passages.
In qrels.train, there are 4 columns, from the documentation, column 0 is queryID, column 2 is passageID, what are column 1&3? Which one is is_selected? How about the other? Here are examples from the documention.

1185868 0       16      1
597651  0       49      1
403613  0       60      1

Cannot find the top1000.eval for testing

Hi,

Why there's no link to the eval set for reranking task anymore?

I remembered the link was followed after Top 1000 Dev in the dataset page once.

Thanks

Training data with QID and PID

Thank you for creating such a great dataset for passage re-ranking!

I'm wondering if it is possible to release the top 1000 passages retrieved for each training/dev/test query with the corresponding QID and PID? The current training data is constructed with the raw text of queries and passages, which are too huge to use. Also, since the qrels files are actually constructed with QID and PID, it would make life much easier if the train/dev/test data are also constructed with QID and PID.

[docs] Data is not JSONL

The docs say that the 2.1 data is in JSONL format but the training data I downloaded from http://www.msmarco.org/dataset.aspx is not:

$ wc -l train_v2.1.json
0 train_v2.1.json

Similarly:

i = 0
with open('train_v2.1.json') as f:
    for l in f:
        i += 1
print(i) # 1

How to understand the gain score about 'No Answer Present' ?

Hello, I can't understand the sentences in Readme as follow:
Given a query q and passages P, if the reference answer is 'No Answer Present' and the system produces and answer award a score of 0, If the reference answer is not 'No Answer Present' and the system produces the answer of 'No answer Present' award a score of 0. All other options gain a score of 1.
How to use the score in evaluation? Is there any examples?
And in the evaluation script 'ms_marco_eval.py' , I found it filter the no reference answer queries. Whether is the answer [''] equal to the answer ['No Answer Present']?

MSMARCOV2/Ranking/README.md is not formatted correctly

Section and sub-section titles in the README.MD file are not being shown correctly, which makes the document hard to read. I see ### and #### tags, not the title properly formatted.

spacemanidol / msmarco Goto Github PK

msmarco's People

Contributors

Stargazers

Watchers

Forkers

msmarco's Issues

Recommend Projects

Recommend Topics

Recommend Org