Giter VIP home page Giter VIP logo

microsoft / msmarco-conversational-search Goto Github PK

View Code? Open in Web Editor NEW
105.0 12.0 21.0 18 KB

Truly Conversational Search is the next logic step in the journey to generate intelligent and useful AI. To understand what this may mean, researchers have voiced a continuous desire to study how people currently converse with search engines. Traditionally, the desire to produce such a comprehensive dataset has been limited because those who have this data (Search Engines) have a responsibility to their users to maintain their privacy and cannot share the data publicly in a way that upholds the trusts users have in the Search Engines. Given these two powerful forces we believe we have a dataset and paradigm that meets both sets of needs: A artificial public dataset that approximates the true data and an ability to evaluate model performance on the real user behavior. What this means is we released a public dataset which is generated by creating artificial sessions using embedding similarity and will test on the original data. To say this again: we are not releasing any private user data but are releasing what we believe to be a good representation of true user interactions.

Home Page: https://microsoft.github.io/MSMARCO-Conversational-Search/

License: MIT License

Shell 6.62% Python 93.38%

msmarco-conversational-search's Issues

No root downloader

Hi, is it feasible to have a downloadable file without sudo permissions ?

Some queries in sessions don't appear in MS-Marco train

Hi, I'm willing to retrieve the queries ids for each query in each session.
I tried an exact matching algorithm and it fails retrieving 9M queries which corresponds to 16k unique queries and 8M sessions.

I looked a bit in the queries that didn't match. I found some that were truncated and thus are quite easy to retrieve, and the rest I tested was just not present at all. When I searched for the missing queries I looked without the case so no problem with low or capital letters.

How can this be classified as a bug or a feature ?

Here is a code similar to the one I used :

from collections import defaultdict

sessions_path = ""
queries_path = ""

q_id = defaultdict(lambda : -1)  # Store query -> qid
with open(queries_path, "r") as f:
    for line in f:
        qid, query = line.strip().split("\t")
        words = tuple(query.split(" "))
        q_id[words] = qid

sess_ids = {} 
nb_fails, sess_fails = 0, 0
with open(sessions_path, "r") as f, open("output", "w") as wri: 
    for line in f: 
        sess_id, *queries = line.strip().split("\t") 
        qids = [] 
        temp_fail = [] 
        sess_fail = False
        for query in queries: 
            q = tuple(query.split(" ")) 
            qids.append(q_id[q]) 
            if q_id[q] == -1: 
                wri.write(f"{q} \n") 
                nb_fails += 1
                sess_fail = True
        sess_ids[sess_id] = qids
        nb_sess_fail += 1 

Scripts provided don't work

Hi, I tried to use the provided scripts to download and generate data but the README doesn't match de python files.

README :

python generateQuerySets.py <msmarco train queries> <msmarco dev queries> <msmarco eval queries> <quoraQueries> <NQFolder>

Python File :

python generateQuerysets.py <msmarco train queries> <msmarco dev queries> <msmarco eval queries> <quoraQueries> <NQFolder> <unused msmarco> <sessions>

I supposed I had to use the "generateArtificialSessions.py" file but I need embeddings which are only available once the first file has been executed. It seems there's a circular need.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.