Harvesting and Refining Question-Answer Pairs for Unsupervised QA
This repo contains the data, codes and models for the ACL2020 paper "Harvesting and Refining Question-Answer Pairs for Unsupervised QA".
In this work, we introduce two approaches to improve unsupervised QA. First, we harvest lexically and syntactically divergent questions from Wikipedia to automatically construct a corpus of question-answer pairs (named as RefQA). Second, we take advantage of the QA model to extract more appropriate answers, which iteratively refines data over RefQA. We conduct experiments on SQuAD 1.1, and NewsQA by fine-tuning BERT without access to manually annotated data. Our approach outperforms previous unsupervised approaches by a large margin and is competitive with early supervised models.
Environment
With Docker
The recommended way to run the code is using docker under Linux. The Dockerfile is in uqa/docker/Dockerfile
.
With Pip
First you need to install PyTorch 1.1.0. Please refer to PyTorch installation page.
Then, you can clone this repo and install dependencies by uqa/scripts/install_tools.sh
:
git clone -q https://github.com/NVIDIA/apex.git
cd apex ; git reset --hard 1603407bf49c7fc3da74fceb6a6c7b47fece2ef8
python3 setup.py install --user --cuda_ext --cpp_ext
pip install --user cython tensorboardX six numpy tqdm path.py pandas scikit-learn lmdb pyarrow py-lz4framed methodtools py-rouge pyrouge nltk
python3 -c "import nltk; nltk.download('punkt')"
pip install -e git://github.com/Maluuba/nlg-eval.git#egg=nlg-eval
pip install --user spacy==2.2.0 pytorch-transformers==1.2.0 tensorflow-gpu==1.13.1
python3 -m spacy download en
pip install --user benepar[gpu]
The mixed-precision training code requires the specific version of NVIDIA/apex, which only supports pytorch<1.2.0.
Data and Models
The format of our generated data is SQuAD-like. The data can be downloaded from here.
The links to the trained models:
- refqa-main: The trained model using 300k RefQA examples;
- refqa-refine: The trained model by our refining process.
Constructing RefQA
In our released data, the wikiref.json
file (our raw data) contains the Wikipedia statements and corresponding cited documents (the summary
and document
key for each item).
You can convert the raw data to our RefQA by the following script:
export REFQA_DATA_DIR=/{path_to_refqa_data}/
python3 wikiref_process.py \
--input_file wikiref.json \
--output_file cloze_clause_wikiref_data.json
python3 cloze2natural.py \
--input_file cloze_clause_wikiref_data.json \
--output_file refqa.json
Note: Please make sure that the file wikiref.json
is in the directory $REFQA_DATA_DIR
.
Then, for the following refining process, you should split your generated data to several parts, such as a main data to train an initial QA model and other parts to do refining process.
Training and Refining
Before running on RefQA, you should download/move the data and the SQuAD 1.1 dev file dev-v1.1.json
to the directory $REFQA_DATA_DIR
.
We train our QA model using distributed and mixed-precision training on 4 P100 GPUs.
Training the initial QA model
You can fine-tune BERT-Large (WWM) on 300k RefQA examples and achieve a F1 > 65 on SQuAD 1.1 dev set.
export REFQA_DATA_DIR=/{path_to_refqa_data}/
export OUTPUT_DIR=/{path_to_main_output}/
export CUDA_VISIBLE_DEVICES=0,1,2,3
python3 -m torch.distributed.launch --nproc_per_node=4 run_squad.py \
--model_type bert \
--model_name_or_path bert-large-uncased-whole-word-masking \
--do_train \
--do_eval \
--do_lower_case \
--train_file $REFQA_DATA_DIR/uqa_train_main.json \
--predict_file $REFQA_DATA_DIR/dev-v1.1.json \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir $OUTPUT_DIR \
--per_gpu_train_batch_size=6 \
--per_gpu_eval_batch_size=4 \
--seed 42 \
--fp16 \
--overwrite_output_dir \
--logging_steps 1000 \
--save_steps 1000
Refining RefQA data iteratively
We provide a fine-tuned checkpoint (downloaded from here) used for refining process. The refining process is conducted as follows:
export REFQA_DATA_DIR=/{path_to_refqa_data}/
export MAIN_MODEL_DIR=/{path_to_previous_fine-tuned_model}/
export OUTPUT_DIR=/{path_to_refine_output}/
export CUDA_VISIBLE_DEVICES=0,1,2,3
python3 multi_turn.py \
--refine_data_dir $REFQA_DATA_DIR \
--output_dir $OUTPUT_DIR \
--model_dir $MAIN_MODEL_DIR \
--predict_file $REFQA_DATA_DIR/dev-v1.1.json \
--generate_method 2 \
--score_threshold 0.15 \
--threshold_rate 0.9 \
--seed 17 \
--fp16
The multi_turn.py
provides the following command line arguments:
positional arguments:
--refine_data_dir The directory of RefQA data for refining
--model_dir The directory of the init checkpoint
--output_dir The output directory
--predict_file SQuAD or other json for predictions. E.g., dev-v1.1.json
optional arguments:
--generate_method {1|2} The method of generating data for next training,
1 is using refined data only, 2 is merging refined data with filtered data (1:1 ratio)
--score_threshold The threshold for filtering predicted answers
--threshold_rate The decay factor for the above threshold
--seed Random seed for initialization
--fp16 Whether to use 16-bit (mixed) precision (through NVIDIA apex)
Citation
If you find this repo useful in your research, you can cite the following paper:
@misc{li2020refqa,
title={Harvesting and Refining Question-Answer Pairs for Unsupervised QA},
author={Zhongli Li and Wenhui Wang and Li Dong and Furu Wei and Ke Xu},
year={2020},
eprint={2005.02925},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Acknowledgment
Our code is based on pytorch-transformers 1.2.0. We thank the authors for their wonderful open-source efforts.