This repository is for our paper Unsupervised Domain Adaptation for Sparse Retrieval by Filling Vocabulary and Word Frequency Gaps, which is accepted at AACL-IJCNLP2022.
$ pip install ./
$ pip install -r requirements.txt
- Get BEIR and MSMARCO from BEIR Repo
- It is necessary to download BioASK from the original cite.
- We prepare a script, prepare_bioask.py.
- Get Domain Corpus
- Process the corpus. We prepare following scripts
- proc_pubmed.py
- proc_s2orc.py
$ cd adalm_scripts/run
$ bash add_new_vocab_all.sh
$ bash init_add_vocab_model_all.sh
$ bash run_mlm.sh <model_path> <path_to_outdir> <path_to_corpus_file>
- Please rewrite path in bash files.
$ cd /path/to/this/repo
$ cd training/run
$ bash train_splade_distil.sh <model_path> <output_dir>
- Please rewrite path in bash files.
- We prepare other bash files for training dense retrieval and GPL in the dir. If you experiment with them please check them.
$ cd /path/to/this/repo
$ cd search/run
$ bash search_splade.sh <beir_data_dir> <result_out_dir> <model_path>
- If you'd like to remove idf weight, change mode val from idf to org in the bash file.
- If you'd like to run splade-doc, please set d-idf or d-org in the bash file
- We prepare other bash files for search with dense retrieval and BM25. If you experiment with them please check them.