Giter VIP home page Giter VIP logo

univl-dr's Introduction

OpenMatch v2

An all-in-one toolkit for information retrieval. Under active development.

Install

git clone https://github.com/OpenMatch/OpenMatch.git
cd OpenMatch
pip install -e .

-e means editable, i.e. you can change the code directly in your directory.

We do not include all the requirements in the package. You may need to manually install torch, tensorboard.

You may also need faiss for dense retrieval. You can install either faiss-cpu or faiss-gpu, according to your enviroment. Note that if you want to perform search on GPUs, you need to install the version of faiss-gpu compatible with your CUDA. In some cases (usually CUDA >= 11.0) pip installs a wrong version. If you encounter errors during search on GPUs, you may try installing it from conda.

Features

  • Human-friendly interface for dense retriever and re-ranker training and testing
  • Various PLMs supported (BERT, RoBERTa, T5...)
  • Native support for common IR & QA Datasets (MS MARCO, NQ, KILT, BEIR, ...)
  • Deep integration with Huggingface Transformers and Datasets
  • Efficient training and inference via stream-style data loading

Docs

Documentation Status

We are actively working on the docs.

Project Organizers

  • Zhiyuan Liu
  • Zhenghao Liu
  • Chenyan Xiong
  • Maosong Sun

Acknowledgments

Our implementation uses Tevatron as the starting point. We thank its authors for their contributions.

Contact

Please email to [email protected].

univl-dr's People

Contributors

edwardzh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

univl-dr's Issues

How to handle Different snippet id but Same fact and url for text document.

Hi I'm a student trying to reproduce your research.

I took a look at the dataset and realized that there was data in the text document that had a different snippet_id but the exact same fact and wiki url.

For example, in WebQA_train_val.json

{
    "title": "2008 Summer Olympics",
    "fact": "The theme song of the 2008 Summer Olympics was \"You and Me,\" which was composed by Chen Qigang, the musical director of the opening ceremony.",
    "url": "https://en.wikipedia.org/wiki/2008_Summer_Olympics",
    "snippet_id": "d5bbd0e20dba11ecb1e81171463288e9_7"
}
{
    "title": "2008 Summer Olympics",
    "fact": "The theme song of the 2008 Summer Olympics was \"You and Me,\" which was composed by Chen Qigang, the musical director of the opening ceremony.",
    "url": "https://en.wikipedia.org/wiki/2008_Summer_Olympics",
    "snippet_id": "d5bbd13c0dba11ecb1e81171463288e9_8"
}
{
    "title": "2008 Summer Olympics",
    "fact": "The theme song of the 2008 Summer Olympics was \"You and Me,\" which was composed by Chen Qigang, the musical director of the opening ceremony.",
    "url": "https://en.wikipedia.org/wiki/2008_Summer_Olympics",
    "snippet_id": "d5bcc8440dba11ecb1e81171463288e9_14"
}

All three examples have different snippet_ids,
but the same fact: "The theme song of the 2008 Summer Olympics was "You and Me," which was composed by Chen Qigang, the musical director of the opening ceremony.",
and the same url: "https://en.wikipedia.org/wiki/2008_Summer_Olympics".

It seems to me that different facts should be given different snippet_ids to evaluate accurate search performance.
It looks like you collected all the text and image documents in the train_val.json and test.json files, extracted the embeddings, then trained the model, and I'm curious how you solved it in your study.

If there is something I am missing or misunderstanding, I would appreciate it if you could let me know.
I haven't figured out if there are more examples like this, but I'd like to correct my misconceptions first.

Thank you for your help.

Question about sum(caption embeddings, image embeddings)

First of all, thank you for the code and paper~
I wonder why caption embeddings and image embeddings can be added directly element by element?I have known that they are aligned by minimizing contrast loss in clip.
Thanks again!

Image Verbalization for Expansion

Hello, I'm a student reproducing the paper.
I have a question while reading the paper and looking at the code.
First of all, thank you for sharing the code.

Q1. In the paper 4.3 (IMAGE VERBALIZATION FOR EXPANSION), we need to obtain image verbalization results $V (I_j )$ to expend the raw captions passing through the text encoder.

$C^*_j = C_j ; [SEP]; V (I_j )$  (8)

However, I could not find a part that generates a potentially matching caption or related queries that corresponds to the image verbalization result.
I would appreciate it if you could tell me which part of the code corresponds to Image Verbalization.

Once again, thank you for revealing the code and I look forward to your reply.

Question about the `data` folder

It's strange for me to recognize the data folder. I was regred it as the dataset, but the real dataset seems to be the WebQA dataset.So, I want to figure out what is the real effect of files in the data folder. Thank you for your answer

execute BM25 & CLIP-DPR

To reproduce the experimental results of BM25 & CLIP-DPR, which parts of the code need to be executed?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.