openmatch / openmatch Goto Github PK

View Code? Open in Web Editor NEW

137.0 4.0 18.0 62.46 MB

An Open-Source Package for Information Retrieval

License: MIT License

Python 92.61% Shell 7.39%

information-retrieval open-domain-question-answering neural-search

openmatch's Introduction

OpenMatch v2

An all-in-one toolkit for information retrieval. Under active development.

Install

git clone https://github.com/OpenMatch/OpenMatch.git
cd OpenMatch
pip install -e .

-e means editable, i.e. you can change the code directly in your directory.

We do not include all the requirements in the package. You may need to manually install torch, tensorboard.

You may also need faiss for dense retrieval. You can install either faiss-cpu or faiss-gpu, according to your enviroment. Note that if you want to perform search on GPUs, you need to install the version of faiss-gpu compatible with your CUDA. In some cases (usually CUDA >= 11.0) pip installs a wrong version. If you encounter errors during search on GPUs, you may try installing it from conda.

Features

Human-friendly interface for dense retriever and re-ranker training and testing
Various PLMs supported (BERT, RoBERTa, T5...)
Native support for common IR & QA Datasets (MS MARCO, NQ, KILT, BEIR, ...)
Deep integration with Huggingface Transformers and Datasets
Efficient training and inference via stream-style data loading

Docs

We are actively working on the docs.

Project Organizers

Zhiyuan Liu
- Tsinghua University
- Homepage
Zhenghao Liu
- Northeastern University
- Homepage
Chenyan Xiong
- Microsoft Research AI
- Homepage
Maosong Sun
- Tsinghua University
- Homepage

Acknowledgments

Our implementation uses Tevatron as the starting point. We thank its authors for their contributions.

Contact

Please email to [email protected].

openmatch's People

Contributors

Stargazers

Watchers

Forkers

yesthing edwardzh helloxcq jeryi-sun jwxsp1 kunlun-zhu faker-bert yanjiangjerry gokulprasadthekkel luoqichan wangyx-ethan jmvcoelho bkeit bokesyo jxzhangjhu dive-into-papers lihuibng ranonrkm

openmatch's Issues

Update package requirements

transformers>=4.21.3
datasets>=2.10.1

bug fixes

HuggingFace version

I ran OpenMatch dense retrieval according to docs/dr-msmarco-passage.md, but when installing the dependencies according to README, the evaluation on the dev set during training has problems when the huggingface is up to date(4.27.1)

Question about similarity functions in (T5-)ANCE.

In paper ANCE , the ANCE model uses cosine similarity; where in this repo, the T5-ANCE model seems to use dot similarity.

Have you tried using cosine similarity in T5-ANCE? Does it perform worse than dot similarity? Are there any criteria for determining which similarity function should be used in different models?

reformat changes

Update .gitignore

Add general build_train script

'self.head' saving error when using monot5

If 'self.head' is None, this will result in a saving error, such as using monot5.

has_answers Issue

There is an error in current has_answers function when meeting special characters, such as "café".

We should refer to 'pyserini':

from pyserini.eval.evaluate_dpr_retrieval import has_answers

elif self.pooling == "mean":
      soft_prompt_attention_mask = items.attention_mask
      soft_prompt_attention_mask[
          :, : model.soft_prompt_token_number
      ] = torch.zeros(
          (items.attention_mask.shape[0], model.soft_prompt_token_number)
      )
      reps = mean_pooling(
          hidden, soft_prompt_attention_mask
      )  # only pool hidden reps of real tokens

manually set gpu
remove unuseful del stataments