princeton-nlp / densephrases Goto Github PK

[ACL 2021] Learning Dense Representations of Phrases at Scale; EMNLP'2021: Phrase Retrieval Learns Passage Retrieval, Too https://arxiv.org/abs/2012.12624

Home Page: https://arxiv.org/abs/2012.12624

License: Apache License 2.0

Makefile 3.89% Shell 0.61% Python 92.77% CSS 0.17% HTML 2.56%

nlp open-domain-qa slot-filling knowledge-base information-retrieval passage-retrieval

densephrases's Introduction

DensePhrases

DensePhrases is a text retrieval model that can return phrases, sentences, passages, or documents for your natural language inputs. Using billions of dense phrase vectors from the entire Wikipedia, DensePhrases searches phrase-level answers to your questions in real-time or retrieves passages for downstream tasks.

Please see our ACL paper (Learning Dense Representations of Phrases at Scale) for details on how to learn dense representations of phrases and the EMNLP paper (Phrase Retrieval Learns Passage Retrieval, Too) on how to perform multi-granularity retrieval.

***** Try out our online demo of DensePhrases here! *****

Updates

[Jan 18, 2022] DensePhrases v1.1.0 released for transformers==4.13.0 (see notes).
[Nov 22, 2021] Test prediction files of densephrases-multi-query-* added.
[Oct 10, 2021] See our blog post on phrase retrieval to learn more about phrase retrieval!
[Sep 23, 2021] More examples on entity linking, knowledge-grounded dialouge, and slot filling.
[Sep 20, 2021] Pre-trained models are also available on the Huggingface model hub.
[Sep 17, 2021] Check out updates on multi-granularity retrieval, smaller phrase indexes (20~60GB), and more examples!
[Sep 17, 2021] Our new EMNLP paper on phrase-based passage retrieval is out!
[June 14, 2021] Major code updates

Getting Started

After installing DensePhrases and dowloading a phrase index you can easily retrieve phrases, sentences, paragraphs, or documents for your query.

densephrases-interactive.mp4

See here for more examples such as using CPU-only mode, creating a custom index, and more.

You can also use DensePhrases to retrieve relevant documents for a dialogue or run entity linking over given texts.

>>> from densephrases import DensePhrases

# Load DensePhrases for dialogue and entity linking
>>> model = DensePhrases(
...     load_dir='princeton-nlp/densephrases-multi-query-kilt-multi',
...     dump_dir='/path/to/densephrases-multi_wiki-20181220/dump',
... )

# Retrieve relevant documents for a dialogue
>>> model.search('I love rap music.', retrieval_unit='document', top_k=5)
['Rapping', 'Rap metal', 'Hip hop', 'Hip hop music', 'Hip hop production']

# Run entity linking for the target phrase denoted as [START_ENT] and [END_ENT]
>>> model.search('[START_ENT] Security Council [END_ENT] members expressed concern on Thursday', retrieval_unit='document', top_k=1)
['United Nations Security Council']

We provide more examples, which includes training a state-of-the-art open-domain question answering model called Fusion-in-Decoder by Izacard and Grave, 2021.

Quick Link

Installation
Resources: datasets, pre-trained models, phrase indexes
Examples
Playing with a DensePhrases Demo
Traning, Indexing and Inference
Pre-processing

Installation

# Install torch with conda (please check your CUDA version)
conda create -n densephrases python=3.7
conda activate densephrases
conda install pytorch=1.9.0 cudatoolkit=11.0 -c pytorch

# Install apex
git clone https://www.github.com/nvidia/apex.git
cd apex
python setup.py install
cd ..

# Install DensePhrases
git clone -b v1.0.0 https://github.com/princeton-nlp/DensePhrases.git
cd DensePhrases
pip install -r requirements.txt
python setup.py develop

main branch uses python==3.7 and transformers==2.9.0. See below for other versions of DensePhrases.

Release	Note	Description
v1.0.0	link	`transformers==2.9.0`, same as `main`
v1.1.0	link	`transformers==4.13.0`

Resources

Before downloading the required files below, please set the default directories as follows and ensure that you have enough storage to download and unzip the files:

# Running config.sh will set the following three environment variables:
# DATA_DIR: for datasets (including 'kilt', 'open-qa', 'single-qa', 'truecase', 'wikidump')
# SAVE_DIR: for pre-trained models or index; new models and index will also be saved here
# CACHE_DIR: for cache files from Huggingface Transformers
source config.sh

To download the resources described below, you can use download.sh as follows:

# Use bash script to download data (change data to models or index accordingly)
source download.sh
Choose a resource to download [data/wiki/models/index]: data
data will be downloaded at ...
...
Downloading data done!

1. Datasets

Datasets (1GB) - Pre-processed datasets including reading comprehension, generated questions, open-domain QA and slot filling. Download and unzip it under $DATA_DIR or use download.sh.
Wikipedia dumps (5GB) - Pre-processed Wikipedia dumps in different sizes. See here for more details. Download and unzip it under $DATA_DIR or use download.sh.

# Check if the download is complete
ls $DATA_DIR
kilt  open-qa  single-qa  truecase  wikidump

2. Pre-trained Models

Huggingface Transformers

You can use pre-trained models from the Huggingface model hub. Any model name that starts with princeton-nlp (specified in load_dir) will be automatically translated as a model in our Huggingface model hub.

>>> from densephrases import DensePhrases

# Load densephraes-multi-query-nq from the Huggingface model hub
>>> model = DensePhrases(
...     load_dir='princeton-nlp/densephrases-multi-query-nq',
...     dump_dir='/path/to/densephrases-multi_wiki-20181220/dump',
... )

Model list

Model	Query-FT.	NQ	WebQ	TREC	TriviaQA	SQuAD	Description
densephrases-multi	None	31.9	25.5	35.7	44.4	29.3	EM before any Query-FT.
densephrases-multi-query-multi	Multiple	40.8	35.0	48.8	53.3	34.2	Used for demo

Model	Query-FT. & Eval	EM	Prediction (Test)	Description
densephrases-multi-query-nq	NQ	41.3	link	-
densephrases-multi-query-wq	WebQ	41.5	link	-
densephrases-multi-query-trec	TREC	52.9	link	`--regex` required
densephrases-multi-query-tqa	TriviaQA	53.5	link	-
densephrases-multi-query-sqd	SQuAD	34.5	link	-

Important: all models except densephrases-multi are query-side fine-tuned on the specified dataset (Query-FT.) using the phrase index densephrases-multi_wiki-20181220. Also note that our pre-trained models are case-sensitive models and the best results are obtained when --truecase is on for any lowercased queries (e.g., NQ).

densephrases-multi: trained on mutiple reading comprehension datasets (NQ, WebQ, TREC, TriviaQA, SQuAD).
densephrases-multi-query-multi: densephrases-multi query-side fine-tuned on multiple open-domain QA datasets (NQ, WebQ, TREC, TriviaQA, SQuAD).
densephrases-multi-query-*: densephrases-multi query-side fine-tuned on each open-domain QA dataset.

For pre-trained models in other tasks (e.g., slot filling), see examples. Note that most pre-trained models are the results of query-side fine-tuning densephrases-multi.

Download manually

Pre-trained models (8GB) - All pre-trained DensePhrases models (including cross-encoder teacher models spanbert-base-cased-*). Download and unzip it under $SAVE_DIR or use download.sh.

# Check if the download is complete
ls $SAVE_DIR
densephrases-multi  densephrases-multi-query-nq  ...  spanbert-base-cased-squad

>>> from densephrases import DensePhrases

# Load densephraes-multi-query-nq locally
>>> model = DensePhrases(
...     load_dir='/path/to/densephrases-multi-query-nq',
...     dump_dir='/path/to/densephrases-multi_wiki-20181220/dump',
... )

3. Phrase Index

Please note that you don't need to download this phrase index unless you want to work on the full Wikipedia scale.

densephrases-multi_wiki-20181220 (74GB) - Original phrase index (1048576_flat_OPQ96) + metadata for the entire Wikipedia (2018.12.20). Download and unzip it under $SAVE_DIR or use download.sh.

We also provide smaller phrase indexes based on more aggresive filtering (optional).

1048576_flat_OPQ96_medium (37GB) - Medium-sized phrase index
1048576_flat_OPQ96_small (21GB) - Small-sized phrase index

These smaller indexes should be placed under $SAVE_DIR/densephrases-multi_wiki-20181220/dump/start along with any other indexes you downloaded. If you only use a smaller phrase index and don't want to download the large index (74GB), you need to download metadata (20GB) and place it under $SAVE_DIR/densephrases-multi_wiki-20181220/dump folder as shown below. The structure of the files should look like:

$SAVE_DIR/densephrases-multi_wiki-20181220
└── dump
    ├── meta_compressed.pkl
    └── start
        ├── 1048576_flat_OPQ96
        ├── 1048576_flat_OPQ96_medium
        └── 1048576_flat_OPQ96_small

All phrase indexes are created from the same model (densephrases-multi) and you can use all of pre-trained models above with any of these phrase indexes. To change the index, simply set index_name (or --index_name in densephrases/options.py) as follows:

>>> from densephrases import DensePhrases

# Load DensePhrases with a smaller index
>>> model = DensePhrases(
...     load_dir='princeton-nlp/densephrases-multi-query-multi',
...     dump_dir='/path/to/densephrases-multi_wiki-20181220/dump',
...     index_name='start/1048576_flat_OPQ96_small'
... )

The performance of densephrases-multi-query-nq on Natural Questions (test) with different phrase indexes is shown below.

Phrase Index	Open-Domain QA (EM)	Sentence Retrieval (Acc@1/5)	Passage Retrieval (Acc@1/5)	Size	Description
1048576_flat_OPQ96	41.3	48.7 / 66.4	52.6 / 71.5	60GB	evaluated with `eval-index-psg`
1048576_flat_OPQ96_medium	39.9	48.3 / 65.8	52.2 / 70.9	39GB
1048576_flat_OPQ96_small	38.0	47.2 / 64.0	50.7 / 69.1	20GB

Note that the passage retrieval accuracy (Acc@1/5) is generally higher than the reported numbers in the paper since these phrase indexes return natural paragraphs instead of fixed-sized text blocks (i.e., 100 words).

Playing with a DensePhrases Demo

You can run the Wikipedia-scale demo on your own server. For your own demo, you can change the phrase index (obtained from here) or the query encoder (e.g., to densephrases-multi-query-nq).

The resource requirement for running the full Wikipedia scale demo is:

50 ~ 100GB RAM (depending on the size of a phrase index)
Single 11GB GPU (optional)

Note that you no longer need an SSD to run the demo unlike previous phrase retrieval models (DenSPI, DenSPI+Sparc). The following commands serve exactly the same demo as here on your http://localhost:51997.

# Serve a query encoder on port 1111
nohup python run_demo.py \
    --run_mode q_serve \
    --cache_dir $CACHE_DIR \
    --load_dir princeton-nlp/densephrases-multi-query-multi \
    --cuda \
    --max_query_length 32 \
    --query_port 1111 > $SAVE_DIR/logs/q-serve_1111.log &

# Serve a phrase index on port 51997 (takes several minutes)
nohup python run_demo.py \
    --run_mode p_serve \
    --index_name start/1048576_flat_OPQ96 \
    --cuda \
    --truecase \
    --dump_dir $SAVE_DIR/densephrases-multi_wiki-20181220/dump/ \
    --query_port 1111 \
    --index_port 51997 > $SAVE_DIR/logs/p-serve_51997.log &

# Below are the same but simplified commands using Makefile
make q-serve MODEL_NAME=densephrases-multi-query-multi Q_PORT=1111
make p-serve DUMP_DIR=$SAVE_DIR/densephrases-multi_wiki-20181220/dump/ Q_PORT=1111 I_PORT=51997

Please change --load_dir or --dump_dir if necessary and remove --cuda for CPU-only version. Once you set up the demo, the log files in $SAVE_DIR/logs/ will be automatically updated whenever a new question comes in. You can also send queries to your server using mini-batches of questions for faster inference.

# Test on NQ test set
python run_demo.py \
    --run_mode eval_request \
    --index_port 51997 \
    --test_path $DATA_DIR/open-qa/nq-open/test_preprocessed.json \
    --eval_batch_size 64 \
    --save_pred \
    --truecase

# Same command with Makefile
make eval-demo I_PORT=51997

# Result
(...)
INFO - eval_phrase_retrieval -   {'exact_match_top1': 40.83102493074792, 'f1_score_top1': 48.26451418695196}
INFO - eval_phrase_retrieval -   {'exact_match_top10': 60.11080332409972, 'f1_score_top10': 68.47386731458751}
INFO - eval_phrase_retrieval -   Saving prediction file to $SAVE_DIR/pred/test_preprocessed_3610_top10.pred

For more details (e.g., changing the test set), please see the targets in Makefile (q-serve, p-serve, eval-demo, etc).

DensePhrases: Training, Indexing and Inference

In this section, we introduce a step-by-step procedure to train DensePhrases, create phrase vectors and indexes, and run inferences with the trained model. All of our commands here are simplified as Makefile targets, which include exact dataset paths, hyperparameter settings, etc.

If the following test run completes without an error after the installation and the download, you are good to go!

# Test run for checking installation (takes about 10 mins; ignore the performance)
make draft MODEL_NAME=test

A figure summarizing the overall process below

1. Training phrase and query encoders

To train DensePhrases from scratch, use run-rc-nq in Makefile, which trains DensePhrases on NQ (pre-processed for the reading comprehension task) and evaluate it on reading comprehension as well as on (semi) open-domain QA. You can simply change the training set by modifying the dependencies of run-rc-nq (e.g., nq-rc-data => sqd-rc-data and nq-param => sqd-param for training on SQuAD). You'll need a single 24GB GPU for training DensePhrases on reading comprehension tasks, but you can use smaller GPUs by setting --gradient_accumulation_steps properly.

# Train DensePhrases on NQ with Eq. 9 in Lee et al., ACL'21
make run-rc-nq MODEL_NAME=densephrases-nq

run-rc-nq is composed of the six commands as follows (in case of training on NQ):

make train-rc ...: Train DensePhrases on NQ with Eq. 9 (L = lambda1 L_single + lambda2 L_distill + lambda3 L_neg) with generated questions.
make train-rc ...: Load trained DensePhrases in the previous step and further train it with Eq. 9 with pre-batch negatives.
make gen-vecs: Generate phrase vectors for D_small (= set of all passages in NQ dev).
make index-vecs: Build a phrase index for D_small.
make compress-meta: Compresss metadata for faster inference.
make eval-index ...: Evaluate the phrase index on the development set questions.

At the end of step 2, you will see the performance on the reading comprehension task where a gold passage is given (about 72.0 EM on NQ dev). Step 6 gives the performance on the semi-open-domain setting (denoted as D_small; see Table 6 in the paper) where the entire passages from the NQ development set is used for the indexing (about 62.0 EM with NQ dev questions). The trained model will be saved under $SAVE_DIR/$MODEL_NAME. Note that during the single-passage training on NQ, we exclude some questions in the development set, whose annotated answers are found from a list or a table.

2. Creating a phrase index

Let's assume that you have a pre-trained DensePhrases named densephrases-multi, which can also be downloaded from here. Now, you can generate phrase vectors for a large-scale corpus like Wikipedia using gen-vecs-parallel. Note that you can just download the phrase index for the full Wikipedia scale and skip this section.

# Generate phrase vectors in parallel for a large-scale corpus (default = wiki-dev)
make gen-vecs-parallel MODEL_NAME=densephrases-multi START=0 END=8

The default text corpus for creating phrase vectors is wiki-dev located in $DATA_DIR/wikidump. We have three options for larger text corpora:

wiki-dev: 1/100 Wikipedia scale (sampled), 8 files
wiki-dev-noise: 1/10 Wikipedia scale (sampled), 500 files
wiki-20181220: full Wikipedia (20181220) scale, 5621 files

The wiki-dev* corpora also contain passages from the NQ development set, so that you can track the performance of your model with an increasing size of the text corpus (usually decreases as it gets larger). The phrase vectors will be saved as hdf5 files in $SAVE_DIR/$(MODEL_NAME)_(data_name)/dump (e.g., $SAVE_DIR/densephrases-multi_wiki-dev/dump), which will be referred to $DUMP_DIR below.

Parallelization

START and END specify the file index in the corpus (e.g., START=0 END=8 for wiki-dev and START=0 END=5621 for wiki-20181220). Each run of gen-vecs-parallel only consumes 2GB in a single GPU, and you can distribute the processes with different START and END using slurm or shell script (e.g., START=0 END=200, START=200 END=400, ..., START=5400 END=5621). Distributing 28 processes on 4 24GB GPUs (each processing about 200 files) can create phrase vectors for wiki-20181220 in 8 hours. Processing the entire Wikiepdia requires up to 500GB and we recommend using an SSD to store them if possible (a smaller corpus can be stored in a HDD).

After generating the phrase vectors, you need to create a phrase index for the sublinear time search of phrases. Here, we use IVFOPQ for the phrase index.

# Create IVFOPQ index for a set of phrase vectors
make index-vecs DUMP_DIR=$SAVE_DIR/densephrases-multi_wiki-dev/dump/

For wiki-dev-noise and wiki-20181220, you need to modify the number of clusters to 101,372 and 1,048,576, respectively (simply change medium1-index in ìndex-vecs to medium2-index or large-index). For wiki-20181220 (full Wikipedia), this takes about 1~2 days depending on the specification of your machine and requires about 100GB RAM. For IVFSQ as described in the paper, you can use index-add and index-merge to distribute the addition of phrase vectors to the index.

You also need to compress the metadata (saved in hdf5 files together with phrase vectors) for a faster inference of DensePhrases. This is mandatory for the IVFOPQ index.

# Compress metadata of wiki-dev
make compress-meta DUMP_DIR=$SAVE_DIR/densephrases-multi_wiki-dev/dump

For evaluating the performance of DensePhrases with your phrase indexes, use eval-index.

# Evaluate on the NQ test set questions
make eval-index MODEL_NAME=densephrases-multi DUMP_DIR=$SAVE_DIR/densephrases-multi_wiki-dev/dump/

3. Query-side fine-tuning

Query-side fine-tuning makes DensePhrases a versatile tool for retrieving multi-granularity text for different types of input queries. While query-side fine-tuning can also improve the performance on QA datasets, it can be used to adapt DensePhrases to non-QA style input queries such as "subject [SEP] relation" to retrieve object entities or "I love rap music." to retrieve relevant documents on rapping.

First, you need a phrase index for the full Wikipedia (wiki-20181220), which can be simply downloaded here, or a custom phrase index as described here. Given your query-answer or query-document pairs pre-processed as json files in $DATA_DIR/open-qa or $DATA_DIR/kilt, you can easily query-side fine-tune your model. For instance, the training set of T-REx ($DATA_DIR/kilt/trex/trex-train-kilt_open_10000.json) looks as follows:

{
    "data": [
        {
            "id": "111ed80f-0a68-4541-8652-cb414af315c5",
            "question": "Effie Germon [SEP] occupation",
            "answers": [
                "actors",
                ...
            ]
        },
        ...
    ]
}

The following command query-side fine-tunes densephrases-multi on T-REx.

# Query-side fine-tune on T-REx (model will be saved as MODEL_NAME)
make train-query MODEL_NAME=densephrases-multi-query-trex DUMP_DIR=$SAVE_DIR/densephrases-multi_wiki-20181220/dump/

Note that the pre-trained query encoder is specified in train-query as --load_dir $(SAVE_DIR)/densephrases-multi and a new model will be saved as densephrases-multi-query-trex as specified in MODEL_NAME. You can also train on different datasets by changing the dependency trex-open-data to *-open-data (e.g., ay2-kilt-data for entity linking).

4. Inference

With any DensePhrases query encoders (e.g., densephrases-multi-query-nq) and a phrase index (e.g., densephrases-multi_wiki-20181220), you can test your queries as follows and the retrieval results will be saved as a json file with the --save_pred option:

# Evaluate on Natural Questions
make eval-index MODEL_NAME=densephrases-multi-query-nq DUMP_DIR=$SAVE_DIR/densephrases-multi_wiki-20181220/dump/

# If the demo is being served on http://localhost:51997
make eval-demo I_PORT=51997

For the evaluation on different datasets, simply change the dependency of eval-index (or eval-demo) accordingly (e.g., nq-open-data to trec-open-data for the evaluation on CuratedTREC).

Pre-processing

At the bottom of Makefile, we list commands that we used for pre-processing the datasets and Wikipedia. For training question generation models (T5-large), we used https://github.com/patil-suraj/question_generation (see also here for QG). Note that all datasets are already pre-processed including the generated questions, so you do not need to run most of these scripts. For creating test sets for custom (open-domain) questions, see preprocess-openqa in Makefile.

Questions?

Feel free to email Jinhyuk Lee ([email protected]) for any questions related to the code or the paper. You can also open a Github issue. Please try to specify the details so we can better understand and help you solve the problem.

References

Please cite our paper if you use DensePhrases in your work:

@inproceedings{lee2021learning,
    title={Learning Dense Representations of Phrases at Scale},
    author={Lee, Jinhyuk and Sung, Mujeen and Kang, Jaewoo and Chen, Danqi},
    booktitle={Association for Computational Linguistics (ACL)},
    year={2021}
}

@inproceedings{lee2021phrase,
    title={Phrase Retrieval Learns Passage Retrieval, Too},
    author={Lee, Jinhyuk and Wettig, Alexander and Chen, Danqi},
    booktitle={Conference on Empirical Methods in Natural Language Processing (EMNLP)},
    year={2021},
}

License

Please see LICENSE for details.

densephrases's People

Contributors

Stargazers

Watchers

densephrases's Issues

phrase vector generation

While running this code :

python generate_phrase_vecs.py --model_type bert --pretrained_name_or_path SpanBERT/spanbert-base-cased --data_dir ./ --cache_dir $CACHE_DIR --predict_file sample/articles.json --do_dump --max_seq_length 512 --doc_stride 500 --fp16 --filter_threshold -2.0 --append_title --load_dir $SAVE_DIR/densephrases-multi --output_dir $SAVE_DIR/densephrases-multi_sample

I am getting below error:

File "generate_phrase_vecs.py", line 396, in
main()
File "generate_phrase_vecs.py", line 392, in main
dump_phrases(args, model, tokenizer)
File "generate_phrase_vecs.py", line 85, in dump_phrases
args, tokenizer, evaluate=True, output_examples=True, context_only=True
File "/home/neerajku/neerajAvatar/DensePhrases/densephrases/utils/squad_utils.py", line 1199, in load_and_cache_examples
context_only=context_only, args=args)
File "/home/neerajku/neerajAvatar/DensePhrases/densephrases/utils/squad_utils.py", line 792, in get_dev_examples
return self._create_examples(input_data, "dev", draft, context_only=context_only, args=args)
File "/home/neerajku/neerajAvatar/DensePhrases/densephrases/utils/squad_utils.py", line 805, in _create_examples
truecase = TrueCaser(os.path.join(os.environ['DATA_DIR'], args.truecase_path))
File "/home/neerajku/neerajAvatar/DensePhrases/densephrases/utils/squad_utils.py", line 1301, in init
with open(dist_file_path, "rb") as distributions_file:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/neerajku/DensePhrases//densephrases-data/truecase/english_with_questions.dist'

Could you help me , how to resolve this

Unable to Reproduce Passage Retrieval Results on NQ

Hi Jinhyuk,

I was trying to reproduce the third row of Table 1 in your paper (https://arxiv.org/pdf/2109.08133.pdf). I'm using the index and pre-trained ckpt on NQ you gave me several days ago. Here's my results:

Top-1 = 34.32%
Top-5 = 54.13%
Top-20 = 66.59%
Top-100 = 76.43%
Acc@1 when Acc@100 = 44.91%
MRR@20 = 43.12
P@20 = 14.61

Here's the command I use:

make eval-index-psg MODEL_NAME=densephrases-nq-query-nq DUMP_DIR=densephrases-nq_wiki-20181220-p100/dump/ TEST_DATA=open-qa/nq-open/test_preprocessed.json

Any idea what I might do wrong? Thanks in advance.

Minghan

IndexError: index 99 is out of bounds for axis 0 with size 35

I create a custom index and finished all steps but when I test it i faced this error

 File "/DensePhrases/densephrases/index.py", line 507, in search
    outs = self.search_phrase(
  File "/DensePhrases/densephrases/index.py", line 428, in search_phrase
    out = [{
  File "/DensePhrases/densephrases/index.py", line 431, in <listcomp>
    'end_pos': (groups_all[doc_idx]['word2char_end'][groups_all[doc_idx]['f2o_start'][end_idx]].item()
IndexError: index 99 is out of bounds for axis 0 with size 35

How to choose phrase to encode in wikipedia document

Hi there,
In this issue, you point out that you use span answers in the dataset as phrases for training, how about in the inference step when you have a new corpus, how do you choose phrases from a document to encode (you extract entities in the document as phrases or just any contiguous segment of text up to L words ?)

Iterative retrieval in case of non-unique top-k retrieval

Hi! Thanks for this amazing work, and for making your code open-source.

I'm trying to figure out where in the code is non-unique passage retrieval handled that ensures that the final k results are unique. According to this footnote on page 3 in your paper "Phrase Retrieval Learns Passage Retrieval, Too", it seems that you perform iterative retrieval to achieve this. Could you point me to the code where this is happening?

DensePhrases for non-answerable questions

Hello!

Thanks for the amazing work on phrase retrieval. I was recently working on a question answering project and stumbled upon this work. Our project however involves non-answerable questions as in Squad2.0.

To prevent answering, I tried to threshold the score obtained in the meta data of the results, but it seemed that it was not well calibrated (since it is not trained on non-answerable sets)

Could you please guide me on how I could train this model to perform a good distinction between answerable and non-answerable questions. Otherwise, advice on properly using score to achieve this would also be great.

Significance of line 174 in train_query.py code

Hi,

I was going through the code for query finetuning and I am not able to understand one condition in the code:

Is the above highlighted line redundant and if not what is the significance (I feel we can directly update the encoder). Just wanted to make sure that I am not missing anything.

Recipe to build dense representations from corpus

HI,

I'm trying to create a dense representations from my corpus and search paragraphs/phrases by keywords or a question. I don't have labeled Questions and Answers and I don't need for now to get answers, just retrieve documents possibly containing the answer.

I build a JSON with my corpus (pt-br) like this:

{
    "data": [
        {
            "title": "Radicais livres: o que são, efeitos no corpo e como se proteger",
            "paragraphs": [
                {
                    "context": "Os radicais livres ...""
                },
                {
                    "context": "Desta forma, quanto menos radicais livres, ..."
                }, ...

then I ran the following commands:

python generate_phrase_vecs.py \
    --pretrained_name_or_path SpanBERT/spanbert-base-cased \
    --data_dir ./data \
    --cache_dir ./cache \
    --test_file ../tua-saude/all_data.json \
    --do_dump \
    --max_seq_length 512 \
    --fp16 \
    --filter_threshold -2.0 \
    --append_title \
    --output_dir ./data/densephrases-multi_sample \
    --load_dir princeton-nlp/densephrases-multi
    
python build_phrase_index.py \
    --dump_dir ./data/densephrases-multi_sample/dump \
    --stage all \
    --replace \
    --num_clusters 128 \
    --fine_quant OPQ96 \
    --doc_sample_ratio 0.3 \
    --vec_sample_ratio 0.3 \
    --cuda
    
python scripts/preprocess/compress_metadata.py \
    --input_dump_dir ./data/densephrases-multi_sample/dump/phrase \
    --output_dir ./data/densephrases-multi_sample/dump

Those commads looks like working fine. Here the contents of output_dir

Now, when I try to use the model:

model = DensePhrases(
     load_dir='princeton-nlp/densephrases-multi',
     dump_dir='./data/densephrases-multi_sample/dump/',
     index_name='start/128_flat_OPQ96'
)

This error raises:

>>> 
This could take up to 15 mins depending on the file reading speed of HDD/SSD
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/projetos/u4vn/DensePhrases/densephrases/model.py", line 52, in __init__
    self.truecase = TrueCaser(os.path.join(os.environ['DATA_DIR'], self.args.truecase_path))
  File "/projetos/u4vn/DensePhrases/densephrases/utils/data_utils.py", line 366, in __init__
    with open(dist_file_path, "rb") as distributions_file:
FileNotFoundError: [Errno 2] No such file or directory: './data/truecase/english_with_questions.dist'

What am I missing? What file is this?

How to extract phrases from Wikipedia?

Hi!

First of all thanks a lot for this solid project!

I just want to figure out how to extract phrases from Wikipedia? Which script is the right one?
I am a little confused when I see so many scripts in the preprocess folder.

can prebatch negatives be used to train dpr?

Representations of phrases

Hi,

Thanks for the interesting project!

One question: If I want to get only phrase representations from your pre-trained model, how can I do that? I plan to use them as baselines. Thank you!

Best,
Jiacheng

run_demo.py : IndexError: index out of range in self

Hi,
I'm trying to utilize DensePhrases as Korean version, not English version. The trained model is bert-base-multilingual-cased model (not span-bert model because it doesn't support Korean) on korquad dataset. The dataset sourced from Korean Wikipedia was manually preprocessed and used to create a custom phrase index.

The error that I encountered when running run_demo.py is as shown below:


02/14/2023 07:08:04 - ERROR - tornado.access -   500 POST /query2vec_api (127.0.0.1) 11.09ms
02/14/2023 07:08:04 - INFO - densephrases.utils.embed_utils -   ---feature : {'input_ids': [101, 48556, 16617, 119202, 9323, 51431, 14040, 34907, 10892, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'qas_id': 0, 'question_text': '한국전쟁 발발시점은'} , 
 example['question_text'] : 한국전쟁 발발시점은
02/14/2023 07:08:04 - INFO - densephrases.utils.embed_utils -   ---query_eval_features [{'input_ids': [101, 48556, 16617, 119202, 9323, 51431, 14040, 34907, 10892, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'qas_id': 0, 'question_text': '한국전쟁 발발시점은', 'unique_id': 1000000000}]---
02/14/2023 07:08:04 - INFO - densephrases.utils.embed_utils -   ---question_dataloader <torch.utils.data.dataloader.DataLoader object at 0x7f76d1e7da60>---
02/14/2023 07:08:04 - INFO - densephrases.utils.embed_utils -   ---DEVICE cpu---
02/14/2023 07:08:04 - INFO - densephrases.utils.embed_utils -   ---BATCH [tensor([[   101,  48556,  16617, 119202,   9323,  51431,  14040,  34907,  10892,
            102,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0]]), tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]]), tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]]), tensor([0])]---
02/14/2023 07:08:04 - INFO - densephrases.utils.embed_utils -   ---len(batch) 4---
02/14/2023 07:08:04 - INFO - densephrases.utils.embed_utils -   --tmp 1---
02/14/2023 07:08:04 - ERROR - run_demo -   Exception on /query2vec_api [POST]
Traceback (most recent call last):
  File "/opt/conda/envs/densephrases-v1.1.0/lib/python3.8/site-packages/flask/app.py", line 2073, in wsgi_app
    response = self.full_dispatch_request()
  File "/opt/conda/envs/densephrases-v1.1.0/lib/python3.8/site-packages/flask/app.py", line 1518, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/opt/conda/envs/densephrases-v1.1.0/lib/python3.8/site-packages/flask_cors/extension.py", line 165, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/opt/conda/envs/densephrases-v1.1.0/lib/python3.8/site-packages/flask/app.py", line 1516, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/conda/envs/densephrases-v1.1.0/lib/python3.8/site-packages/flask/app.py", line 1502, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "run_demo.py", line 59, in query2vec_api
    outs = list(query2vec(batch_query))
  File "/data/code-server/baseline/DensePhrases/densephrases/utils/embed_utils.py", line 412, in get_question_results
    outputs = model(**inputs)
  File "/opt/conda/envs/densephrases-v1.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/code-server/baseline/DensePhrases/densephrases/encoder.py", line 177, in forward
    query_start, query_end = self.embed_query(input_ids_, attention_mask_, token_type_ids_)
  File "/data/code-server/baseline/DensePhrases/densephrases/encoder.py", line 132, in embed_query
    outputs_s_ = self.query_start_encoder(
  File "/opt/conda/envs/densephrases-v1.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/densephrases-v1.1.0/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 992, in forward
    embedding_output = self.embeddings(
  File "/opt/conda/envs/densephrases-v1.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/densephrases-v1.1.0/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 214, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/opt/conda/envs/densephrases-v1.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/densephrases-v1.1.0/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/opt/conda/envs/densephrases-v1.1.0/lib/python3.8/site-packages/torch/nn/functional.py", line 2044, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
02/14/2023 07:08:04 - ERROR - tornado.access -   500 POST /query2vec_api (127.0.0.1) 7.52ms

I tried to figure out input and output shape, but still don't know what to do to get them.
It would be appreciated if you could help me. Thanks in advance.

failed with "make draft MODEL_NAME=test"

logs as following, thanks

convert squad examples to features: 100%|█████████████████████████████████████████████████████████████████████████| 902/902 [00:00<00:00, 2092.37it/s]
add example index and unique id: 100%|██████████████████████████████████████████████████████████████████████████| 902/902 [00:00<00:00, 439863.06it/s]
06/14/2022 22:26:05 - INFO - main - Number of trainable params: 258,127,108
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
06/14/2022 22:26:05 - INFO - main - ***** Running training *****
06/14/2022 22:26:05 - INFO - main - Num examples = 1218
06/14/2022 22:26:05 - INFO - main - Num Epochs = 2
06/14/2022 22:26:05 - INFO - main - Instantaneous batch size per GPU = 48
06/14/2022 22:26:05 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 384
06/14/2022 22:26:05 - INFO - main - Gradient Accumulation steps = 1
06/14/2022 22:26:05 - INFO - main - Total optimization steps = 8
Epoch: 0%| | 0/2 [00:00<?, ?it/s]06/14/2022 22:26:05 - INFO - main -
[Epoch 1]
06/14/2022 22:26:05 - INFO - main - Initialize pre-batch of size 2 for Epoch 1

raceback (most recent call last): | 0/4 [00:00<?, ?it/s]
File "train_rc.py", line 593, in
main()
File "train_rc.py", line 537, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "train_rc.py", line 222, in train
outputs = model(**inputs)
File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/torch/_utils.py", line 425, in reraise
raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/ssd1/zhangyiming/densephrase/DensePhrases/densephrases/encoder.py", line 132, in forward
start, end = self.embed_phrase(input_ids, attention_mask, token_type_ids)
File "/ssd1/zhangyiming/densephrase/DensePhrases/densephrases/encoder.py", line 94, in embed_phrase
outputs = self.phrase_encoder(
File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/transformers/modeling_bert.py", line 707, in forward
attention_mask, input_shape, self.device
File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/transformers/modeling_utils.py", line 113, in device
return next(self.parameters()).device
StopIteration

Implementation of contrastive loss with in-passage negative

Thanks for sharing interesting paper and code.

As far as I understand, all the loss function used in the paper is implemented in encoder.py. However, it is unclear for me in which part in code address in-passage negtive (defined in section 4.1 of EMNLP21 paper). Could you provide me the related parts in code? Thank you.

Great idea, understanding is a vector inverted index,right?

Question about faiss parameter

Hi,

Thanks for the amazing work! May I ask how do you choose the parameter for faiss index? Like the number of clusters and quant type OPQ96? It seems that the number of clusters varies with the number of phrases to save.

Thanks!

Missing files for preprocessing natural questions

Hi,

I am trying to make the dataset for natural questions. There seems to be a file missing according to the Makefile.
Is there any plan to including the file as another release?

Missing file scripts/preprocess/create_nq_reader.py
https://github.com/princeton-nlp/DensePhrases/blob/main/Makefile#L384

How to train cross-encoder teacher models?

It seems that I cannot find the code to train the cross-encoder teacher models, such as

spanbert-base-cased-nq
spanbert-base-cased-sqdnq
spanbert-base-cased-squad

Can you tell me how to train these model If I change the train data?

Missing files for preprocessing Wikipedia

Hi!

First of all thanks a lot for this amazing project!!

When going through the repo I have noticed some missing scripts in the scripts\preprocess folder. Namely:

build_db.py
prep_wikipedia.py
build_wikisquad.py

Thanks in advance :)

Pointers on using this for retrieval-reader problem with growing context

Thanks for the great work! I have read the paper and understood the Gist of the approach but I am out-of-depth on few things so I wanted to clarify few things here and get some suggestion.
My problem statement is straight-forward, I have thousands of documents as context and I need to retrieve an answer for query in real-time. I came across your approach and I wanted to try it out and possibly adapt this in industry setting. One other constrain is that the context grows, that is, new documents get added on and the retrieval needs to support that.

Do you think your approach is suitable for this use-case?

Thanks in Advance

Train custom dataset

Hi Jhyuklee,

Thank you for good works and support.
I have one query. Here I want use my custom pdf statements as a dump in place of Wikipedia dump, and want a model to get information from pdf data rather than getting it from wikipedia.

Do I need to freshly train our whole dump data or is there a way where I can fine tune this model based on checkpoints trained by you.

Pls guide.

Issue while creating faiss index, Command is not clear

Hi,

What is the all in this command, I am getting unrecognized command error when i remove all.

python build_phrase_index.py \
    $SAVE_DIR/densephrases-multi_sample/dump all \
    --replace \
    --num_clusters 32 \
    --fine_quant OPQ96 \
    --doc_sample_ratio 1.0 \
    --vec_sample_ratio 1.0 \
    --cuda

I corrected that by giving --dump_dir before but its not creating anything. Please find the screenshot below,

Where is the code for queries to get phrases searching score rank?

Nice to meet you , I am a newcomer to NLP and I am very intersted in your code , but I am not good at coding ,I havd tried a week to find the code where to get the score of query embedding and the phrases embedding , I have found many score but do not know which is I need .
I would appreciate it if you could answer this question，thanks for your time !
eval_phrase_retrieval.zip
eval_phrase_retrieval.zip

how to evaluate model on SQuAD (non openQA settings)

I notice the EM score of densephrases on SQuAD is 78.3, and I want to reproduce this experiment but I cannot find a script available.

editing the demo file

Hello

thank you for you sharing code but i have question becuase the results of Nq-open and squad have error when i tested them

i changed the code of run demo to make evaultion for the test dataset without using server and using this commad line

python run_demo_edit.py --run_mode eval_request --index_port 51997 --test_path $DATA_DIR/open-qa/squad/test_preprocessed.json --eval_batch_size 64 --save_pred --truecase --cuda --index_name start/1048576_flat_OPQ96 --dump_dir $SAVE_DIR/densephrases-multi_wiki-20181220/dump/

the code work fine but the result is very bad i dont know what the reason

03/07/2023 10:46:24 - INFO - eval_phrase_retrieval -   EM: 0.22, F1: 0.45
03/07/2023 10:46:24 - INFO - eval_phrase_retrieval -   1) Who got the first Nobel Prize in physics
03/07/2023 10:46:24 - INFO - eval_phrase_retrieval -   => groundtruths: ['Wilhelm Conrad Röntgen'], top 5 prediction: ['G', 'He', 'Chad', 'W', 'Davis']
03/07/2023 10:46:24 - INFO - eval_phrase_retrieval -   2) When is the next Deadpool movie being released
03/07/2023 10:46:24 - INFO - eval_phrase_retrieval -   => groundtruths: ['May 18 , 2018'], top 5 prediction: ['214', '[', '2099', 'Pet', '2010']
03/07/2023 10:46:24 - INFO - eval_phrase_retrieval -   3) Which mode is used for short wave broadcast service
03/07/2023 10:46:24 - INFO - eval_phrase_retrieval -   => groundtruths: ['MFSK', 'Olivia'], top 5 prediction: ['98', 'O', '136', "Schöbel's", '990",']
03/07/2023 10:46:26 - INFO - eval_phrase_retrieval -   {'exact_match_top1': 0.22160664819944598, 'f1_score_top1': 0.4542936288088642}
03/07/2023 10:46:26 - INFO - eval_phrase_retrieval -   {'exact_match_top10': 0.9418282548476454, 'f1_score_top10': 2.4090489381348115}
03/07/2023 10:46:26 - INFO - eval_phrase_retrieval -   {'redundancy of top10': 1.4556786703601108}

and i notice that the prediction is only one word here is an example of the predication and true label

Bond ['lament on various worldwide problems']
mon ['peptide bond']
1943 ['1952']
Shahi ['Akbar the Great', 'Babur']

some time give one character

here is my edit code


import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
import json
import argparse
import torch
import os
import random
import numpy as np
import requests
import logging

from tqdm import tqdm
from time import time
from flask import Flask, request, jsonify, render_template, redirect
from flask_cors import CORS
from tornado.wsgi import WSGIContainer
from tornado.httpserver import HTTPServer
from tornado.ioloop import IOLoop
from requests_futures.sessions import FuturesSession

from eval_phrase_retrieval import evaluate_results, evaluate_results_kilt
from densephrases.utils.single_utils import load_encoder
from densephrases.utils.data_utils import TrueCaser
from densephrases.utils.open_utils import load_phrase_index, load_cross_encoder, load_qa_pairs, get_query2vec
from densephrases import Options
from concurrent.futures import Future

logging.basicConfig(format='%(asctime)s - %(levelname)s - %(name)s -   %(message)s', datefmt='%m/%d/%Y %H:%M:%S',
                    level=logging.INFO)
logger = logging.getLogger(__name__)


class DensePhrasesDemo(object):
    def __init__(self, args, inmemory=False, batch_size=64, query_encoder=None, tokenizer=None):
        self.args = args
        self.base_ip = args.base_ip
        self.query_port = args.query_port
        self.index_port = args.index_port
        self.truecase = TrueCaser(os.path.join(os.environ['DATA_DIR'], args.truecase_path))
        self.args.examples_path = os.path.join('densephrases/demo/static', args.examples_path)
        self.mips = load_phrase_index(args)
        self.device = 'cuda' if args.cuda else 'cpu'
        if query_encoder is None:
            self.query_encoder, self.tokenizer, _ = load_encoder(self.device, args, query_only=True)
        self.query2vec = get_query2vec(
            query_encoder=self.query_encoder, tokenizer=self.tokenizer, args=args, batch_size=batch_size
        )
   

    def query2vec_api(self,x):
            batch_query = json.loads(x['query'])
            #print(batch_query)
            # start_time = time()
            outs = list(self.query2vec(batch_query))
            #print(outs)
            # logger.info(f'query2vec {time()-start_time:.3f} for {len(batch_query)} queries: {batch_query[0]}')
            #outs = json.dumps(outs)
            return outs #json.loads(outs)#jsonify(outs)

    


        
    
    def api(self,x):
            query = x['query']
            query = query[:-1] if query.endswith('?') else query
            if args.truecase:
                if query[1:].lower() == query[1:]:
                    query = self.truecase.get_true_case(query)
            out = batch_search(
                [query],
                max_answer_length=args.max_answer_length,
                top_k=args.top_k,
                nprobe=args.nprobe,
            )
            out['ret'] = out['ret'][0]
            #outs = json.loads(json.dumps(outs))
            return outs
            #return jsonify(out)


    def get_examples():
            with open(args.examples_path, 'r') as fp:
                examples = [line.strip() for line in fp.readlines()]
            examples = json.loads(json.dumps(examples))
            return #jsonify(examples)


    
    def batch_search(self, batch_query, max_answer_length=20, top_k=10,
                         nprobe=64, return_vecs=False):
            t0 = time()
            outs, _ = self.embed_query(batch_query)
            start = np.concatenate([out[0] for out in outs], 0)
            end = np.concatenate([out[1] for out in outs], 0)
            query_vec = np.concatenate([start, end], 1)

            rets = self.mips.search(
                query_vec, q_texts=batch_query, nprobe=nprobe,
                top_k=top_k, max_answer_length=max_answer_length,
                return_vecs=return_vecs, aggregate=True,
            )
            for ret_idx, ret in enumerate(rets):
                for rr in ret:
                    rr['query_tokens'] = outs[ret_idx][2]
            t1 = time()
            out = {'ret': rets, 'time': int(1000 * (t1 - t0))}
            return out
    def batch_api(self, post_data):
            batch_query = json.loads(post_data['query'])
            max_answer_length = int(post_data['max_answer_length'])
            top_k = int(post_data['top_k'])
            nprobe = int(post_data['nprobe'])
            out = self.batch_search(
                batch_query,
                max_answer_length=max_answer_length,
                top_k=top_k,
                nprobe=nprobe,
            )
            return out #json.loads(json.dumps(out)) #jsonify(out)
    def get_address(self, port):
        assert self.base_ip is not None and len(port) > 0
        return self.base_ip + ':' + port

    """def embed_query(self, batch_query):
        #emb_session = FuturesSession()
        data={'query': json.dumps(batch_query)}
        r = self.query2vec_api(data) #emb_session.post(self.get_address(self.query_port) + '/query2vec_api',data={'query': json.dumps(batch_query)})
        print(type(r))
        def map_():
            result = r.result()
            emb = result.json()
            return emb, result.elapsed.total_seconds() * 1000
        return map_"""
    def embed_query(self, batch_query):
        data = {'query': json.dumps(batch_query)}
        r = self.query2vec_api(data)

        return r,0


    def query(self, query):
        params = {'query': query}
        res = requests.get(self.get_address(self.index_port) + '/api', params=params)
        
        try:
            outs = json.loads(res)
        except Exception as e:
            logger.info(f'no response or error for q {query}')
            logger.info(res)
        return outs

    def batch_query(self, batch_query, batch_context=None, max_answer_length=20, top_k=10, nprobe=64):
        post_data = {
            'query': json.dumps(batch_query),
            'context': json.dumps(batch_context) if batch_context is not None else json.dumps(batch_query),
            'max_answer_length': max_answer_length,
            'top_k': top_k,
            'nprobe': nprobe,
        }
        res = self.batch_api(post_data)#requests.post(self.get_address(self.index_port) + '/batch_api', data=post_data)
        """if res.status_code != 200:
            logger.info('Wrong behavior %d' % res.status_code)"""
        try:
            outs = json.loads(res)
        except Exception as e:
            logger.info(f'no response or error for q {batch_query}')
            logger.info(res)
        return res

    def eval_request(self, args):
        # Load dataset
        qids, questions, answers, _ = load_qa_pairs(args.test_path, args)

        # Run batch_query and evaluate
        step = args.eval_batch_size
        predictions = []
        evidences = []
        titles = []
        scores = []
        start_time = None
        num_q = 0
        for q_idx in tqdm(range(0, len(questions), step)):
            if q_idx >= 5*step: # exclude warmup
                if start_time is None:
                    start_time = time()
                num_q += len(questions[q_idx:q_idx+step])
            result = self.batch_query(
                questions[q_idx:q_idx+step],
                max_answer_length=args.max_answer_length,
                top_k=args.top_k,
                nprobe=args.nprobe,
            )
            prediction = [[ret['answer'] for ret in out] if len(out) > 0 else [''] for out in result['ret']]
            evidence = [[ret['context'] for ret in out] if len(out) > 0 else [''] for out in result['ret']]
            title = [[ret['title'] for ret in out] if len(out) > 0 else [''] for out in result['ret']]
            score = [[ret['score'] for ret in out] if len(out) > 0 else [-1e10] for out in result['ret']]
            predictions += prediction
            evidences += evidence
            titles += title
            scores += score
        logger.info(f'{time()-start_time:.3f} sec for {num_q} questions => {num_q/(time()-start_time):.1f} Q/Sec')
        eval_fn = evaluate_results if not args.is_kilt else evaluate_results_kilt
        eval_fn(
            predictions, qids, questions, answers, args, evidences=evidences, scores=scores, titles=titles,
        )


if __name__ == '__main__':
    # See options in densephrases.options
    options = Options()
    options.add_model_options()
    options.add_index_options()
    options.add_retrieval_options()
    options.add_data_options()
    options.add_demo_options()
    args = options.parse()

    # Seed for reproducibility
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(args.seed)

    server = DensePhrasesDemo(args)

    if args.run_mode == 'q_serve':
        logger.info(f'Query address: {server.get_address(server.query_port)}')
        server.serve_query_encoder(args.query_port, args)

    elif args.run_mode == 'p_serve':
        logger.info(f'Index address: {server.get_address(server.index_port)}')
        server.serve_phrase_index(args.index_port, args)

    elif args.run_mode == 'query':
        query = 'Name three famous writers'
        result = server.query(query)
        logger.info(f'Answers to a question: {query}')
        logger.info(f'{[r["answer"] for r in result["ret"]]}')

    elif args.run_mode == 'batch_query':
        queries = [
            'Where',
            'when did medicare begin in the united states',
            'who sings don\'t stand so close to me',
            'Name three famous writers',
            'Who was defeated by computer in chess game?'
        ]
        contexts = [
            'Uncle jesse\'s original full name was James Lee. And he was born in South Korea.',
            'Uncle jesse\'s original name was James Lee. And he was born in South Korea in 1333. US medicare started in 1222.',
            'Uncle jesse\'s original name was James Lee. And he sang this song.',
            'Uncle jesse\'s original name was James Lee. And Jens wrote this novel.',
            'Uncle jesse\'s original name was James Lee. The man was defeated by Alphago.'
        ]
        result = server.batch_query(
            queries,
            contexts, # feed context for cross encoders
            max_answer_length=args.max_answer_length,
            top_k=args.top_k,
            nprobe=args.nprobe,
        )
        for query, re in zip(queries, result['ret'].values()):
            logger.info(f'Answers to a question: {query}')
            logger.info(f'{re}')

    elif args.run_mode == 'eval_request':
        print("sadadadada")
        server.eval_request(args)

    else:
        raise NotImplementedError

Modifying num_clusters in index-vecs

I tried to run index-vecs using custom wikidump, dataset and model, but got this error

Modifying num_clusters flags to 96 doesn't seem to help, the k in error message is still 256.

Train custom teacher model

Last night when I tried to train with custom pre-trained model I found out that densephrases also use another pre-trained model for cross-encoder, which is spanbert by default($TEACHER_MODEL).

When I tried to change it with bert-base-cased it failed to run. Can I change this or not? If I can, which model from huggingface I could use as teacher model?

The question about reproduce RC-SQD results

Hi~ Thanks a lot for your open source work.
When I run your code for SQuAD dataset in one passage training, I got 77.3 EM and 85.7 F1. I ran code in this script - python train_rc.py --model_type bert --pretrained_name_or_path SpanBERT/spanbert-base-cased --data_dir densephrases/densephrases-data/single-qa --cache_dir densephrases/cache --train_file squad/train-v1.1_qg_ents_t5large_3500_filtered.json --predict_file squad/dev-v1.1.json --do_train --do_eval --per_gpu_train_batch_size 24 --learning_rate 3e-5 --num_train_epochs 3.0 --max_seq_length 384 --seed 42 --fp16 --fp16_opt_level O1 --lambda_kl 4.0 --lambda_neg 2.0 --lambda_flt 1.0 --filter_threshold -2.0 --append_title --evaluate_during_training --overwrite_output_dir --teacher_dir densephrases/outputs/spanbert-base-cased-squad
I also train this model for another 2 epochs like your makefile using pre-batch negative and train-v1.1.json (the real squad data), but the results is still below the paper results.
(1) Does I should use different hyperparameters? I found your paper use different parameters with your script, such as batch size (84 vs 24) or lambda weight, etc.
(2) In the paper, the results are the average of random seeds?
(3) Do you use the whole nq and squad datasets to train the model?

Reproduction of DensePhrase (w/ PQ, w/o qft) on SQuAD

I've built the compressed DensePhrase index on SQuAD using OPQ96. I haven't run any query-side finetuning yet but here are the results:

11/22/2021 19:50:57 - INFO - main - no_ans/all: 0, 10570
11/22/2021 19:50:57 - INFO - main - Evaluating 10570 answers
11/22/2021 19:50:58 - INFO - main - EM: 21.63, F1: 27.96
11/22/2021 19:50:58 - INFO - main - 1) Which NFL team represented the AFC at Super Bowl 50
11/22/2021 19:50:58 - INFO - main - => groundtruths: ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], top 5 prediction: ['Denver Broncos', 'Pittsburgh Steelers', 'Pittsburgh Steelers', 'Pittsburgh Steelers', 'Pittsburgh Steelers']
11/22/2021 19:50:58 - INFO - main - 2) Which NFL team represented the NFC at Super Bowl 50
11/22/2021 19:50:58 - INFO - main - => groundtruths: ['Carolina Panthers', 'Carolina Panthers', 'Carolina Panthers'], top 5 prediction: ['San Francisco 49ers', 'Chicago Bears', 'Seattle Seahawks', 'Tampa Bay Buccaneers', 'Green Bay Packers']
11/22/2021 19:50:58 - INFO - main - 3) Where did Super Bowl 50 take place
11/22/2021 19:50:58 - INFO - main - => groundtruths: ['Santa Clara, California', "Levi's Stadium", "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California."], top 5 prediction: ['Tacoma, Washington, USA', "Levi's Stadium in Santa Clara, California", 'DeVault Vineyards in Concord, Virginia', "Levi's Stadium in Santa Clara", 'Jinan Olympic Sports Center Gymnasium in Jinan, China']
11/22/2021 19:53:44 - INFO - main - {'exact_match_top1': 21.62724692526017, 'f1_score_top1': 27.958255585698414}
11/22/2021 19:53:44 - INFO - main - {'exact_match_top200': 57.48344370860927, 'f1_score_top200': 73.28679644685603}
11/22/2021 19:53:44 - INFO - main - {'redundancy of top200': 5.308987701040681}
11/22/2021 19:53:44 - INFO - main - Saving prediction file to .//outputs/densephrases-squad-ddp/pred/test_preprocessed_10570_top200.pred
10570it [00:23, 448.84it/s]
11/22/2021 19:54:58 - INFO - main - avg psg len=124.84 for 10570 preds
11/22/2021 19:54:58 - INFO - main - dump to .//outputs/densephrases-squad-ddp/pred/test_preprocessed_10570_top200_psg-top100.json
ctx token length: 124.84
unique titles: 98.20

Top-1 = 27.02%
Top-5 = 42.80%
Top-20 = 56.40%
Top-100 = 69.20%
Acc@1 when Acc@100 = 39.05%
MRR@20 = 34.30
P@20 = 8.94

I understand that index compression results in accuracy loss w/o query-side finetuning. However, the score still looks a little bit too low to me. Could @jhyuklee confirm whether this looks alright?

Unable to file folder for phrase in wikidump

Hi, First of all many thanks for work.

I am trying to test this.
As per documentation I downloaded all 4 tar files (datasets, wikipediadump, pretrained models and phrase index). but while running getting the below mentioned error:

which seems to be finding some phrase folder in wikidump, which is not available at all.

Can u suggest the reason for same.

I have given correct path for all folders.