Giter VIP home page Giter VIP logo

msmarco's People

Contributors

bmitra-msft avatar chsasank avatar kiharaito avatar lintool avatar lsfischer avatar msmarco avatar nasringithub avatar simra avatar yongbowin avatar youonlycompileonce avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

msmarco's Issues

Collection paragraph metadata

The collection.tsv contains the paragraph contents. How do we get the metadata? It's in the full documents, but I don't see how these can easily be linked.

KeyError for converttowellformed.py

  1. On running the converttowellformed.py script from the utils, while it works on the dev and train files, it produces a KeyError for the eval file.
python converttowellformed.py eval_v2.1_public.json eval.json

Traceback (most recent call last):
  File "converttowellformed.py", line 14, in <module>
    makewf(sys.argv[1],sys.argv[2])
  File "converttowellformed.py", line 6, in makewf
    df = df.drop('answers',1)
  File "/home/sudeshna/envs/.env/lib/python3.5/site-packages/pandas/core/frame.py", line 3697, in drop
    errors=errors)
  File "/home/sudeshna/envs/.env/lib/python3.5/site-packages/pandas/core/generic.py", line 3111, in drop
    obj = obj._drop_axis(labels, axis, level=level, errors=errors)
  File "/home/sudeshna/envs/.env/lib/python3.5/site-packages/pandas/core/generic.py", line 3143, in _drop_axis
    new_axis = axis.drop(labels, errors=errors)
  File "/home/sudeshna/envs/.env/lib/python3.5/site-packages/pandas/core/indexes/base.py", line 4404, in drop
    '{} not found in axis'.format(labels[mask]))
KeyError: "['answers'] not found in axis"
  1. The json file created after running the converttowellformed.py script on the dev and train files do not contain wellFormedAnswers as a key.
with jsonlines.open('dev.jsonl') as reader:
	for obj in reader:
            print(obj['query'])
	    print(obj['wellFormedAnswers'])

albany mn population
Traceback (most recent call last):
  File "jsonl_reader.py", line 8, in <module>
    if obj['wellFormedAnswers']:
KeyError: 'wellFormedAnswers'

(I have converted the json file to jsonl prior to accessing it for the second example.)

Include statistics on ranking dataset in documentation

Hi,

Responding to the request of feedback on the documentation, I have a suggestion.

To me it would have been helpful if the size of each split of the dataset were included in the documentation as listed in issue #11. Additionally, it would be interesting to include other characteristics of the dataset such as the average question length, the average passage length, the amount of unique passages included in the top 1000 ranking by BM25 (assuming this is a subset of the 8 million passages in the whole dataset).

Thanks in advance!

Invalid line breaks in the top1000 TSV files of the reranking datasets

Describe the bug

A lot of invalid line breaks are contained in the top1000 TSV files of the reranking datasets.
For example, line 234472 in the top1000.dev.tsv does not start with the IDs.

To Reproduce

% sed -n 234471,234472p top1000.dev.tsv
1082445 3492590 what does unlock my device mean iOS: Understanding the SIM PIN.
 You can lock your SIM card so that it can't be used without a Personal Identification Number (PIN). You can use a SIM pin to prevent access to cellular data networks.In order to use cellular data, you must enter the PIN whenever you swap SIM cards or restart your iPhone or iPad (Wi-Fi + Cellular models).hen restoring the device, you will need to unlock the SIM card to complete the restore process. The device and iTunes display the following prompts to notify you: To complete the restore process: 1  Disconnect the device from your computer. 2  Tap Unlock on the device.

Passage IDs in Qna Dataset

Is your feature request related to a problem? Please describe.
For the Passage Ranking task many researchers should be able to join the URL data from the QnA dataset with the Passage ranking passages. Currently, since the QnA dataset does not have any of the passageIDs .
Describe the solution you'd like
Passage Ids in QnA dataset

Describe alternatives you've considered
N/A

Additional context
N/A

[encoding,Â] top1000.dev.tsv

I read the content from top1000.dev.tsv, but met many encoding problems.

For example:

Right text in queries.tsv:what complication is a potential danger associated with continuous iv infusions?

Problem text in top1000.dev.tsv: what complication is a potential danger associated with continuous iv infusions?

There are many places appear this problem which text contain  in top1000.dev.tsv.

How were passage reranking triples generated?

Hi,

It's not clear how triples were generated. Your documentation says:

These triples(availible in small and large(27 and 270gb respectively)) contain a query followed by a possitive passage and a negative passage.

But it also says

hen, the existing dataset has an annotation of is_selected:1 if a judge used a passage to generate their answer. We consider these as a ranking signla where all passages that have a value of 1 are a true possitive for query relevance for that given passage. Any passage that has a value of 0 is not a true negative.

How are negative passages generated if is_selected:0 is not true negative. Can you please open source the code used to generate these triples.

I think documentation for the dataset needs work. Given the usefulness of the dataset, it's a shame if people are unable to use it because of documentation.

BM25 relevance values for top 1000 eval/dev?

In the documents:

We collected all unique passages(without any normalization) to make a pool of ~8.8 million unique passages. Then, for each query from the existing MSMARCO splits(train,dev, and eval) we ran a standard BM25 to produce 1000 relevant passages. These were ordered by random so each query now has 1000 corresponding passages.

Why you ordered the top1000 retrieved docs by random and didn't store the BM25 relevance values?

Partially duplicated passages extracted

Describe the bug
This is a bug for the ranking passages collection. The bug is that some passages are partially duplicated. The passage is partially replicated, but is also truncated incorrectly - leading to partial words being cut off.

To Reproduce
Below are three example passages.

A search index is available at:
http://boston.lti.cs.cmu.edu/Services/treccast19

http://boston.lti.cs.cmu.edu/Services/treccast19/lemur.cgi?d=0&i=7564736
Average physician assistant salary. Physician assistant’s salary is ranging from $68,587 to $117,554 pay per year. The average physician assistant’s salary is $87,749. Generally, a new physician assistant earns an hourly pay ranging from $28.13 to 50.00.Read more about average PA salary in USA.verage physician assistant salary. Physician assistant’s salary is ranging from $68,587 to $117,554 pay per year. The average physician assistant’s salary is $87,749. Generally, a new physician assistant earns an hourly pay ranging from $28.13 to 50.00.

In the above the duplicate passage starts at ".verage physician assistant salary.." and continues.

http://boston.lti.cs.cmu.edu/Services/treccast19/lemur.cgi?d=0&i=107258
This is a list of goat breeds. There are many recognized breeds of domestic goat (Capra aegagrus hircus) . Goat breeds (especially dairy goats) are some of the oldest defined animal breeds for which breed standards and production records have been kept.his is a list of goat breeds. There are many recognized breeds of domestic goat (Capra aegagrus hircus) . Goat breeds (especially dairy goats) are some of the oldest defined animal breeds for which breed standards and productio

In the above passage the duplicate portion starts at "There are many recognized breeds" and is truncated not on a word boundary.

http://boston.lti.cs.cmu.edu/Services/treccast19/lemur.cgi?d=0&i=15744716
Jamunapari Goat. Jamunapari goat is a very beautiful dairy goat breed which was originated from India. This breed first introduced near a river of Uttar Pradesh named Jamuna. Since this breed is mostly known as Jamunapari goat. They are also known as some other names like Jamnapari, Ram Sagol etc.They become highly meat and milk productive and also very suitable for show. In India this goat breed is considered as the best dairy goat.They are one of the giant goat breeds with a pair of very long ear. The physical characteristics, feeding, housing, breeding and caring of Jamunapari goat are described below.hey are also known as some other names like Jamnapari, Ram Sagol etc. They become highly meat and milk productive and also very suitable for show. In India this goat breed is considered as the best dairy goat. They are one of the giant goat breeds with a pair of very long ear.

The duplicate section starts at ".hey are also known as some other names like Jamnapari, Ram Sagol etc..." Again started not on a good word boundary.

Additional context
Add any other context about the problem here.

some bugs about the code

some adaptation is needed for code downloaded. Otherwise, it will generate errors. (Actually we can change the output file also but I simply change the code downloaded)

  1. for Evaluation/bleu/bleu.py line22 assert(list(gts.keys()) == list(res.keys())) should be changed to
    assert(sorted(gts.keys()) == sorted(res.keys()))
  2. for Evaluation/rouge/rouge.py line85 assert(list(gts.keys()) == list(res.keys())) should be changed to
    assert(sorted(gts.keys()) == sorted(res.keys()))
  3. for Evaluation/ms_marco_eval.py line 132 133
    precision = true_positives/(true_positives+false_positives)
    recall = true_positives/(true_positives+false_negatives)
    should be changed to
    precision = float(true_positives)/(true_positives+false_positives) if (true_positives+false_positives)>0 else 1.
    recall = float(true_positives)/(true_positives+false_negatives) if (true_positives+false_negatives)>0 else 1.
    in case divide zero

keyerror

Why there's a keyerror when I run predict on dev_v2.1.json?
Here is the traceback:
"Augmenting with pre-trained embeddings...
Augmenting with random char embeddings...
Traceback (most recent call last):
File "scripts/predict.py", line 189, in
main()
File "scripts/predict.py", line 182, in main
toks = ' '.join(id_to_token[tok] for tok in toks)
File "scripts/predict.py", line 182, in
toks = ' '.join(id_to_token[tok] for tok in toks)
KeyError: tensor(507946)"

Uncommon train / dev / test split of ranking dataset

Hi,

I have two questions about the train/dev/test split of the ranking dataset. I noted that:

  • The train set queries.train consists of 502,939 questions, of which all 502,939 have at least 1 answer in qrels.train.
  • The dev set queries.dev consists of 12,665 questions, of which only 6,986 have at least 1 answer in qrels.dev.
  • The test set queries.eval consists of 12,560 questions.

Now, my questions are:

  1. Why was roughly a 40:1:1 split made instead of e.g. a more common 8:1:1 split?
  2. Why do only (roughly) 55% of the queries in dev have an answer whereas 100% of the queries in train have an answer?

Thanks in advance!

Need more explanation about Reranking Dataset

In the relevant passages, I am a bit confused about what the column value in each row means.

Such as

1185869 0       0       1
1185868 0       16      1
597651  0       49      1
403613  0       60      1
1183785 0       389     1
312651  0       616     1
80385   0       723     1
645590  0       944     1
645337  0       1054    1
186154  0       1160    1

Given your explanation, Column 0 is queryID, column 2 is passageID, as what will column 1 and column 3 mean, in this case?

Thanks

Full Document May be incorrect tokenization in document_text

I am inspecting the contents of fulldocuments.json. I notice that the json content has possible issues with "document_text" field. In particular, that the text seems to be somewhat incorrect (possibly due to not inserting breaks?)

Examples:
https://answers.yahoo.com/question/index?qid=20071007114826AAwCFvR
example text:
show moreFollow 3 answersAnswersRelevanceRatingNewestOldestBest Answer

http://childparenting.about.com/od/physicalemotionalgrowth/tp/Child-Development-Your-Eight-Year-Old-Child.htm
.4 Cognitive DevelopmentTom Merton/Getty ImagesEight-year-old children are at a stage of intellectual development where they will be able to pay attention for longer periods of time.

Regarding the Test Set for Q&A

Hi,

In your website, you have, training, dev, and evaluation set for Q&A. I want to know, what is the difference between dev-set and evaluation-set? And where can I find the test-data to submit the result for the leaderboard?

encoding, top1000.train, qrels.train

Hey @dfcf93, I get some questions:

  1. What is the encoding of the text? I read as utf-8, but see incorrect encoding in at collections.tsv and for some passage in triples.train.small, I was not able to find its passageId in collections.tsv.
  2. In top1000.train, there are NOT 1000 paragraphs for each query, not only 1-10 passages. While in top1000.dev, most queries (but not all) do have 1000 passages.
  3. In qrels.train, there are 4 columns, from the documentation, column 0 is queryID, column 2 is passageID, what are column 1&3? Which one is is_selected? How about the other? Here are examples from the documention.
1185868 0       16      1
597651  0       49      1
403613  0       60      1

Training data with QID and PID

Thank you for creating such a great dataset for passage re-ranking!

I'm wondering if it is possible to release the top 1000 passages retrieved for each training/dev/test query with the corresponding QID and PID? The current training data is constructed with the raw text of queries and passages, which are too huge to use. Also, since the qrels files are actually constructed with QID and PID, it would make life much easier if the train/dev/test data are also constructed with QID and PID.

How to understand the gain score about 'No Answer Present' ?

Hello, I can't understand the sentences in Readme as follow:
Given a query q and passages P, if the reference answer is 'No Answer Present' and the system produces and answer award a score of 0, If the reference answer is not 'No Answer Present' and the system produces the answer of 'No answer Present' award a score of 0. All other options gain a score of 1.
How to use the score in evaluation? Is there any examples?
And in the evaluation script 'ms_marco_eval.py' , I found it filter the no reference answer queries. Whether is the answer [''] equal to the answer ['No Answer Present']?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.