Giter VIP home page Giter VIP logo

openmatch / coco-dr Goto Github PK

View Code? Open in Web Editor NEW
46.0 4.0 4.0 2.25 MB

[EMNLP 2022] This is the code repo for our EMNLP‘22 paper "COCO-DR: Combating Distribution Shifts in Zero-Shot Dense Retrieval with Contrastive and Distributionally Robust Learning".

Home Page: https://arxiv.org/abs/2210.15212

License: MIT License

Python 98.30% Shell 1.70%
bert dense-retrieval nlp transformer zero-shot zero-shot-retrieval contrastive-learning distributionally information-retrieval pretrained-language-model

coco-dr's Introduction

COCO-DR

PWC

This repo provides the code for reproducing the experiments in paper COCO-DR: Combating Distribution Shifts in Zero-Shot Dense Retrieval with Contrastive and Distributionally Robust Learning (EMNLP 2022 Main Conference).

COCO-DR is a domain adaptation method for training zero-shot dense retrievers. It is based on simple continuous constrastive learning (COCO) and implicit distributional robust learning (iDRO) and can achieve significant improvement over other zero-shot models without using billion-scale models, seq2seq models, and cross-encoder distillation.

Quick Links

BEIR Performance

Model BM25 DocT5query GTR CPT-text GPL COCO-DR Base COCO-DR Large
# of Parameter --- --- 4.8B 178B 66M*18 110M 335M
Avg. on BEIR CPT sub 0.484 0.495 0.516 0.528 0.516 0.520 0.540
Avg. on BEIR 0.428 0.440 0.458 --- 0.459 0.461 0.484

Note:

  • GPL trains a separate model for each task and use cross-encoders for distillation.
  • CPT-text evaluate only on 11 selected subsets of the BEIR benchmark.

Experiment Setup

Environment

  • We use this docker image for all our experiments: mmdog/pytorch:pytorch1.9.0-nccl2.9.9-cuda11.3.
  • For additional packages, please run the following commands in folders.

Datasets

We use BEIR corpora for the COCO step, and use MS Marco dataset in the iDRO step. The procedure for obtaining the datasets will be described as follows.

MS Marco

  • We use the dataset from the same source as the ANCE paper. The commands for downloading the MS Marcodataset can be found in commands/data_download.sh

BEIR

  • We use the dataset released by the original BEIR repo. It can be downloaded at this link.
  • Note that due to copyright restrictions, some datasets are not available.

Experiments

To run the experiments, use the following commands:

COCO Pretraining

The code for reproducing COCO pretraining is in the COCO folder. Please checkout the COCO/README.md for detailed instructions. Note that we start COCO pretraining from the condenser checkpoint. We release the condenser checkpoint using BERT Large as the backbone at this link.

Finetuning with iDRO

  • BM25 Warmup
    • The code for BM25 warmup is in the warmup folder.
  • Training with global hard negative (ANCE):
    • The code for ANCE fine-tuning is in the ANCE folder.

Evaluation on BEIR

The code for evaluation on BEIR is in the evaluate folder.

Checkpoints

Main Experiments

We release the following checkpoints for both COCO-DR Base and COCO-DR Large to facilitate future studies:

  • Pretrained model after COCO step w/o finetuning on MS MARCO.
  • Pretrained model after iDRO step.
  • Pretrained model after iDRO step (but w/o COCO). Note: this model is trained without any BEIR task information.
Model Name Avg. on BEIR Link
COCO-DR Base 0.461 OpenMatch/cocodr-base-msmarco
COCO-DR Base (w/o COCO) 0.447 OpenMatch/cocodr-base-msmarco-idro-only
COCO-DR Base (w/ BM25 Warmup) 0.435 OpenMatch/cocodr-base-msmarco-warmup
COCO-DR Base (w/o Finetuning on MS MARCO) 0.288 OpenMatch/cocodr-base
COCO-DR Large 0.484 OpenMatch/cocodr-large-msmarco
COCO-DR Large (w/o COCO) 0.462 OpenMatch/cocodr-large-msmarco-idro-only
COCO-DR Large (w/ BM25 Warmup) 0.456 OpenMatch/cocodr-large-msmarco-warmup
COCO-DR Large (w/o Finetuning on MS MARCO) 0.316 OpenMatch/cocodr-large

Note: We find a mismatch between the version of HotpotQA dataset we use and the HotpotQA dataset used in BEIR. We rerun the evaluation and update the number for HotpotQA using the latest version in BEIR.

Other Models

Besides, to ensure reproducibility (especially for BERT-large), we also provide checkpoints from some important baselines that are re-implemented by us.

Model Name Link
Condenser Large (w/o Finetuning on MS MARCO) OpenMatch/condenser-large
coCondenser Large (w/o Finetuning on MS MARCO) OpenMatch/co-condenser-large
coCondenser Large (Fine-tuned on MS MARCO) OpenMatch/co-condenser-large-msmarco

Usage

Pre-trained models can be loaded through the HuggingFace transformers library:

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("OpenMatch/cocodr-base-msmarco") 
tokenizer = AutoTokenizer.from_pretrained("OpenMatch/cocodr-base-msmarco") 

Then embeddings for different sentences can be obtained by doing the following:

sentences = [
    "Where was Marie Curie born?",
    "Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.",
    "Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace."
]

inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
embeddings = model(**inputs, output_hidden_states=True, return_dict=True).hidden_states[-1][:, :1].squeeze(1) # the embedding of the [CLS] token after the final layer

Then similarity scores between the different sentences are obtained with a dot product between the embeddings:

score01 = embeddings[0] @ embeddings[1] # 216.9792
score02 = embeddings[0] @ embeddings[2] # 216.6684

Bugs or questions?

If you have any questions related to the code or the paper, feel free to email Yue Yu (yueyu at gatech dot edu) or open an issue. Please try to specify the problem with details so we can help you better and quicker!

Citation

If you find this repository helpful, feel free to cite our publication COCO-DR: Combating Distribution Shifts in Zero-Shot Dense Retrieval with Contrastive and Distributional Robust Learning.

@inproceedings{yu2022cocodr,
  title={COCO-DR: Combating Distribution Shifts in Zero-Shot Dense Retrieval with Contrastive and Distributionally Robust Learning},
  author={Yue Yu and Chenyan Xiong and Si Sun and Chao Zhang and Arnold Overwijk},
  booktitle={Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing},
  pages={1462--1479},
  year={2022}
}

Acknowledgement

We would like to thank the authors from ANCE and Condenser for their open-source efforts.

coco-dr's People

Contributors

yueyu1030 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

coco-dr's Issues

Something questions about part'Pre-processing'

Hello, thanks for your interesting work!

I'm tring to recomplete COCO Pre-training and I noticed that I need to preprocess the dataset.
This is mentioned in the ./COCO-DR/COCO/README.md
image

But when I follow the instructions in it, Something goes wrong in pre_processing_coco.sh.
It calls COCO-DR/COCO/helper/create_train_co_short.py and there's a function called encode_one().

in the line 35&36, item is a Dict but no group, spans key in the Dict. This will cause raise valueKeyError: 'group'

image

log as follows:
image

I noticed that there are only four keys in each line of the dataset: 'id','title',"text','metadata'
Did I miss some steps before preprocessing?
I'm eagerly looking forward to your reply!!! Thanks a lot!

Best regards!

Cannot find Command folder for ACNE

Hello,

I find this project very interesting.

I cannot see any folder for commands/install.sh and commands/run_ance.sh

Could you please upload these files?

Best regards,
Markus

Codes for inference/evaluation

Thank you for the interesting work!

I am wondering if codes for the inference/evaluation on BEIR datasets are available or planned to be shared.

I tried using open-sourced BEIR evaluation codes as follows.
I evaluated COCO-DR (Base) on SciFact dataset.

#### Download scifact.zip dataset and unzip the dataset
dataset = "scifact"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
dataset_dir = 'experiments/datasets/beir/'
data_path = util.download_and_unzip(url, dataset_dir)
#### Provide the data_path where scifact has been downloaded and unzipped
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")

#### Load the SBERT model and retrieve using cosine-similarity
model = DRES(models.SentenceBERT("OpenMatch/cocodr-base-msmarco"), batch_size=16)
retriever = EvaluateRetrieval(model, score_function="dot")
results = retriever.retrieve(corpus, queries)

#### Evaluate your model with NDCG@k, MAP@K, Recall@K and Precision@K  where k = [1,3,5,10,100,1000] 
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)

But I got a lower performance than the reported performance in the paper: I got 0.254 nDCG@10 (reported number is 0.709).

2022-11-08 11:37:22 - NDCG@1: 0.1467
2022-11-08 11:37:22 - NDCG@3: 0.2189
2022-11-08 11:37:22 - NDCG@5: 0.2352
2022-11-08 11:37:22 - NDCG@10: 0.2535
2022-11-08 11:37:22 - NDCG@100: 0.3066
2022-11-08 11:37:22 - NDCG@1000: 0.3415

Best regards,
Jihyuk

Questions about the results in Table 7 in the paper

Hi, I am very interested in your work and have some small questions.
As the results in Table 7 in the paper are not explained too much, I want to know is the metric MRR?
And did you evaluate the results at CodeSearchNet evaluation dataset with 99 queries?
By the way, is there any way to reproduce the results in Table 7 with the code in this repo?

Thanks a lot!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.