spacemanidol / msmarco Goto Github PK
View Code? Open in Web Editor NEWUtilities, Baselines, Statistics and Descriptions Related to the MSMARCO DATASET
License: MIT License
Utilities, Baselines, Statistics and Descriptions Related to the MSMARCO DATASET
License: MIT License
The collection.tsv contains the paragraph contents. How do we get the metadata? It's in the full documents, but I don't see how these can easily be linked.
In the sentence "The link Official evaluation scripts and samples are availible Here", the link https://github.com/dfcf93/MSMARCOV2/tree/master/Ranking/Evaluation is broken.
python converttowellformed.py eval_v2.1_public.json eval.json
Traceback (most recent call last):
File "converttowellformed.py", line 14, in <module>
makewf(sys.argv[1],sys.argv[2])
File "converttowellformed.py", line 6, in makewf
df = df.drop('answers',1)
File "/home/sudeshna/envs/.env/lib/python3.5/site-packages/pandas/core/frame.py", line 3697, in drop
errors=errors)
File "/home/sudeshna/envs/.env/lib/python3.5/site-packages/pandas/core/generic.py", line 3111, in drop
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
File "/home/sudeshna/envs/.env/lib/python3.5/site-packages/pandas/core/generic.py", line 3143, in _drop_axis
new_axis = axis.drop(labels, errors=errors)
File "/home/sudeshna/envs/.env/lib/python3.5/site-packages/pandas/core/indexes/base.py", line 4404, in drop
'{} not found in axis'.format(labels[mask]))
KeyError: "['answers'] not found in axis"
wellFormedAnswers
as a key.with jsonlines.open('dev.jsonl') as reader:
for obj in reader:
print(obj['query'])
print(obj['wellFormedAnswers'])
albany mn population
Traceback (most recent call last):
File "jsonl_reader.py", line 8, in <module>
if obj['wellFormedAnswers']:
KeyError: 'wellFormedAnswers'
(I have converted the json file to jsonl prior to accessing it for the second example.)
This error is encountered while trying to run scripts/train.py. It seems that mrcqa.modules is not present.
Hi,
Responding to the request of feedback on the documentation, I have a suggestion.
To me it would have been helpful if the size of each split of the dataset were included in the documentation as listed in issue #11. Additionally, it would be interesting to include other characteristics of the dataset such as the average question length, the average passage length, the amount of unique passages included in the top 1000 ranking by BM25 (assuming this is a subset of the 8 million passages in the whole dataset).
Thanks in advance!
Describe the bug
A lot of invalid line breaks are contained in the top1000 TSV files of the reranking datasets.
For example, line 234472 in the top1000.dev.tsv does not start with the IDs.
To Reproduce
% sed -n 234471,234472p top1000.dev.tsv
1082445 3492590 what does unlock my device mean iOS: Understanding the SIM PIN.
You can lock your SIM card so that it can't be used without a Personal Identification Number (PIN). You can use a SIM pin to prevent access to cellular data networks.In order to use cellular data, you must enter the PIN whenever you swap SIM cards or restart your iPhone or iPad (Wi-Fi + Cellular models).hen restoring the device, you will need to unlock the SIM card to complete the restore process. The device and iTunes display the following prompts to notify you: To complete the restore process: 1 Disconnect the device from your computer. 2 Tap Unlock on the device.
Is your feature request related to a problem? Please describe.
For the Passage Ranking task many researchers should be able to join the URL data from the QnA dataset with the Passage ranking passages. Currently, since the QnA dataset does not have any of the passageIDs .
Describe the solution you'd like
Passage Ids in QnA dataset
Describe alternatives you've considered
N/A
Additional context
N/A
I read the content from top1000.dev.tsv
, but met many encoding problems.
For example:
Right text in queries.tsv
:what complication is a potential danger associated with continuous iv infusions?
Problem text in top1000.dev.tsv
: what complication is a potential danger associated with continuous iv infusions?
There are many places appear this problem which text contain  in top1000.dev.tsv
.
It seems the dev_as_references.json is for the V1? Do you have the file from V2?
Hi,
It's not clear how triples were generated. Your documentation says:
These triples(availible in small and large(27 and 270gb respectively)) contain a query followed by a possitive passage and a negative passage.
But it also says
hen, the existing dataset has an annotation of is_selected:1 if a judge used a passage to generate their answer. We consider these as a ranking signla where all passages that have a value of 1 are a true possitive for query relevance for that given passage. Any passage that has a value of 0 is not a true negative.
How are negative passages generated if is_selected:0 is not true negative. Can you please open source the code used to generate these triples.
I think documentation for the dataset needs work. Given the usefulness of the dataset, it's a shame if people are unable to use it because of documentation.
In the documents:
We collected all unique passages(without any normalization) to make a pool of ~8.8 million unique passages. Then, for each query from the existing MSMARCO splits(train,dev, and eval) we ran a standard BM25 to produce 1000 relevant passages. These were ordered by random so each query now has 1000 corresponding passages.
Why you ordered the top1000 retrieved docs by random and didn't store the BM25 relevance values?
Thanks for the great work!
Could you let me know how to access or generate the OpenKPAnnotations.tsv
file used in the notebook for key phrase extraction please?
Describe the bug
This is a bug for the ranking passages collection. The bug is that some passages are partially duplicated. The passage is partially replicated, but is also truncated incorrectly - leading to partial words being cut off.
To Reproduce
Below are three example passages.
A search index is available at:
http://boston.lti.cs.cmu.edu/Services/treccast19
http://boston.lti.cs.cmu.edu/Services/treccast19/lemur.cgi?d=0&i=7564736
Average physician assistant salary. Physician assistant’s salary is ranging from $68,587 to $117,554 pay per year. The average physician assistant’s salary is $87,749. Generally, a new physician assistant earns an hourly pay ranging from $28.13 to 50.00.Read more about average PA salary in USA.verage physician assistant salary. Physician assistant’s salary is ranging from $68,587 to $117,554 pay per year. The average physician assistant’s salary is $87,749. Generally, a new physician assistant earns an hourly pay ranging from $28.13 to 50.00.
In the above the duplicate passage starts at ".verage physician assistant salary.." and continues.
http://boston.lti.cs.cmu.edu/Services/treccast19/lemur.cgi?d=0&i=107258
This is a list of goat breeds. There are many recognized breeds of domestic goat (Capra aegagrus hircus) . Goat breeds (especially dairy goats) are some of the oldest defined animal breeds for which breed standards and production records have been kept.his is a list of goat breeds. There are many recognized breeds of domestic goat (Capra aegagrus hircus) . Goat breeds (especially dairy goats) are some of the oldest defined animal breeds for which breed standards and productio
In the above passage the duplicate portion starts at "There are many recognized breeds" and is truncated not on a word boundary.
http://boston.lti.cs.cmu.edu/Services/treccast19/lemur.cgi?d=0&i=15744716
Jamunapari Goat. Jamunapari goat is a very beautiful dairy goat breed which was originated from India. This breed first introduced near a river of Uttar Pradesh named Jamuna. Since this breed is mostly known as Jamunapari goat. They are also known as some other names like Jamnapari, Ram Sagol etc.They become highly meat and milk productive and also very suitable for show. In India this goat breed is considered as the best dairy goat.They are one of the giant goat breeds with a pair of very long ear. The physical characteristics, feeding, housing, breeding and caring of Jamunapari goat are described below.hey are also known as some other names like Jamnapari, Ram Sagol etc. They become highly meat and milk productive and also very suitable for show. In India this goat breed is considered as the best dairy goat. They are one of the giant goat breeds with a pair of very long ear.
The duplicate section starts at ".hey are also known as some other names like Jamnapari, Ram Sagol etc..." Again started not on a good word boundary.
Additional context
Add any other context about the problem here.
There's some problem with the tojson.py; it seems that it ignores the answers in the jsonl file.
There are different number of unique queries in dev set in collectionandqueries.tar.gz (101093) and top1000.dev.tar.gz (6980).
Could you please update collectionandqueries.tar.gz with the smaller version? Otherwise, 101k queries take too long to evaluate.
Same problem in the eval set.
some adaptation is needed for code downloaded. Otherwise, it will generate errors. (Actually we can change the output file also but I simply change the code downloaded)
Why there's a keyerror when I run predict on dev_v2.1.json?
Here is the traceback:
"Augmenting with pre-trained embeddings...
Augmenting with random char embeddings...
Traceback (most recent call last):
File "scripts/predict.py", line 189, in
main()
File "scripts/predict.py", line 182, in main
toks = ' '.join(id_to_token[tok] for tok in toks)
File "scripts/predict.py", line 182, in
toks = ' '.join(id_to_token[tok] for tok in toks)
KeyError: tensor(507946)"
Hi,
I have two questions about the train/dev/test split of the ranking dataset. I noted that:
queries.train
consists of 502,939 questions, of which all 502,939 have at least 1 answer in qrels.train
.queries.dev
consists of 12,665 questions, of which only 6,986 have at least 1 answer in qrels.dev
.queries.eval
consists of 12,560 questions.Now, my questions are:
Thanks in advance!
it even cannot run through sample test cases the author provided
In the relevant passages, I am a bit confused about what the column value in each row means.
Such as
1185869 0 0 1
1185868 0 16 1
597651 0 49 1
403613 0 60 1
1183785 0 389 1
312651 0 616 1
80385 0 723 1
645590 0 944 1
645337 0 1054 1
186154 0 1160 1
Given your explanation, Column 0 is queryID, column 2 is passageID
, as what will column 1
and column 3
mean, in this case?
Thanks
I am inspecting the contents of fulldocuments.json. I notice that the json content has possible issues with "document_text" field. In particular, that the text seems to be somewhat incorrect (possibly due to not inserting breaks?)
Examples:
https://answers.yahoo.com/question/index?qid=20071007114826AAwCFvR
example text:
show moreFollow 3 answersAnswersRelevanceRatingNewestOldestBest Answer
http://childparenting.about.com/od/physicalemotionalgrowth/tp/Child-Development-Your-Eight-Year-Old-Child.htm
.4 Cognitive DevelopmentTom Merton/Getty ImagesEight-year-old children are at a stage of intellectual development where they will be able to pay attention for longer periods of time.
Hi,
In your website, you have, training, dev, and evaluation set for Q&A. I want to know, what is the difference between dev-set and evaluation-set? And where can I find the test-data to submit the result for the leaderboard?
Hey @dfcf93, I get some questions:
utf-8
, but see incorrect encoding in at collections.tsv
and for some passage in triples.train.small
, I was not able to find its passageId in collections.tsv
.top1000.train
, there are NOT 1000 paragraphs for each query, not only 1-10 passages. While in top1000.dev
, most queries (but not all) do have 1000 passages.qrels.train
, there are 4 columns, from the documentation, column 0 is queryID, column 2 is passageID, what are column 1&3? Which one is is_selected
? How about the other? Here are examples from the documention.1185868 0 16 1
597651 0 49 1
403613 0 60 1
Hi,
Why there's no link to the eval set for reranking task anymore?
I remembered the link was followed after Top 1000 Dev
in the dataset
page once.
Thanks
Thank you for creating such a great dataset for passage re-ranking!
I'm wondering if it is possible to release the top 1000 passages retrieved for each training/dev/test query with the corresponding QID and PID? The current training data is constructed with the raw text of queries and passages, which are too huge to use. Also, since the qrels files are actually constructed with QID and PID, it would make life much easier if the train/dev/test data are also constructed with QID and PID.
The docs say that the 2.1 data is in JSONL format but the training data I downloaded from http://www.msmarco.org/dataset.aspx is not:
$ wc -l train_v2.1.json
0 train_v2.1.json
Similarly:
i = 0
with open('train_v2.1.json') as f:
for l in f:
i += 1
print(i) # 1
Hello, I can't understand the sentences in Readme as follow:
Given a query q and passages P, if the reference answer is 'No Answer Present' and the system produces and answer award a score of 0, If the reference answer is not 'No Answer Present' and the system produces the answer of 'No answer Present' award a score of 0. All other options gain a score of 1.
How to use the score in evaluation? Is there any examples?
And in the evaluation script 'ms_marco_eval.py' , I found it filter the no reference answer queries. Whether is the answer [''] equal to the answer ['No Answer Present']?
Section and sub-section titles in the README.MD file are not being shown correctly, which makes the document hard to read. I see ### and #### tags, not the title properly formatted.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.