Giter VIP home page Giter VIP logo

unsuperviseddecomposition's Introduction


PyTorch original implementation of "Unsupervised Question Decomposition for Question Answering" (EMNLP 2020).

TL;DR: We decompose hard (multi-hop) questions into several, easier (single-hop) questions using unsupervised learning. Our decompositions improve multi-hop QA on HotpotQA without requiring extra supervision to decompose questions.


XLM contains the code to train (Unsupervised) Seq2Seq models, based on the code from XLM. We made the following changes/additions:

  • Unsupervised stopping criterion
  • Tensorboard logging
  • Data preprocessing scripts
  • Minor bug fixes from original XLM code
  • When initializing a smaller Seq2Seq model with XLM_en pretrained weights, automatically initialize the encoder with the first XLM_en layer weights and the decoder with the remaining layer weights.

pytorch-transformers contains the code to train question answering models (single-hop and multi-hop), based on the code from transformers. We made the following additions:

  • Scripts/notebooks to preprocess data
  • Additions to evaluation to handle/evaluate on HotpotQA (i.e., extend single-paragraph SQuAD implementation to multi-paragraph setting)

10/2020: Update! added additional data and resources:

  • Simple and multihop mined questions
  • Multihop QA model checkpoints
  • MLM pretraining data
  • Unsupervised MT training data


Create an anaconda3 environment (we used anaconda3 version 5.0.1):

conda create -y -n UnsupervisedDecomposition python=3.7
conda activate UnsupervisedDecomposition
# Install PyTorch 1.0. We used CUDA 10.0 (with NCCL/2.4.7-1) (see to install with other CUDA versions):
conda install -y pytorch=1.0 torchvision cudatoolkit=10.0 -c pytorch
conda install faiss-gpu cudatoolkit=10.0 -c pytorch # For CUDA 10.0
pip install -r requirements.txt
python -m spacy download en_core_web_lg  # Download Spacy model for NER

If your hardware supports half-precision (fp16), you can install NVIDIA apex to speed up QA model training. Also, set the MAIN_DIR variable to point to the main directory for this repo, e.g.:

export MAIN_DIR=/path/to/UnsupervisedDecomposition

Downloading and Preprocessing Data

Run once, to download/prepare the necessary files for decomposition and question answering training, e.g.:

bash --main_dir $MAIN_DIR

See below to train a decomposition model, or skip to "QA Model Training" to train a question answering model given our trained decomposition model (XLM/dumped/umt.dev1.pseudo_decomp.replace_entity_by_type/20639223/best-valid_mlm_ppl.pth). You can view our generations from the model in the downloaded files XLM/dumped/umt.dev1.pseudo_decomp.replace_entity_by_type/20639223/{train|valid}

Unsupervised Decomposition Training

Create pseudo-decomposition training data using FastText embeddings and entity replacement using , e.g.:

bash --main_dir $MAIN_DIR

Then, train an Unsupervised Seq2Seq model as follows (initializing from our pre-trained MLM model):

# Set the following parameters based on your hardware
export NPROC_PER_NODE=8  # Use 1 for single-GPU training
export N_NODES=1  # Use >1 for multi-node training (where each node has NPROC_PER_NODE GPUs)
BS=32  # Make batch size smaller if GPU goes out-of-memory. Effective batch size is BS*NPROC_PER_NODE*N_NODES

# Select an MLM initialization checkpoint (for now, let's load the MLM we already pre-trained)

# Train USeq2Seq model
if [[ $NPROC_PER_NODE -gt 1 ]]; then DIST_OPTS="-m torch.distributed.launch --nproc_per_node=$NPROC_PER_NODE"; else DIST_OPTS=""; fi
NUM_TRAIN=`wc -l < data/umt/$DATA_FOLDER/processed/`
python $DIST_OPTS --exp_name umt.$DATA_FOLDER --data_path data/umt/$DATA_FOLDER/processed --dump_path ./dumped/ --reload_model "$MLM_INIT,$MLM_INIT" --encoder_only false --emb_dim 2048 --n_layers 6 --n_heads 16 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --use_lang_emb true --lgs 'mh-sh' --ae_steps 'mh,sh' --bt_steps 'mh-sh-mh,sh-mh-sh' --stopping_criterion 'valid_mh-sh-mh_mt_effective_goods_back_bleu,2' --validation_metrics 'valid_mh-sh-mh_mt_effective_goods_back_bleu' --eval_bleu true --epoch_size $((4*NUM_TRAIN/(NPROC_PER_NODE*N_NODES))) --lambda_ae '0:1,100000:0.1,300000:0' --optimizer 'adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.00003' --tokens_per_batch 1024 --batch_size $BS --word_shuffle 3 --word_dropout 0.1 --word_blank 0.1 --max_len 128 --bptt 128 --save_periodic 0 --split_data true --validation_weight 0.5

New: to train UMT with the same data we used, download our splits here

Seq2Seq Decomposition Training (Optional)

Alternatively, you can train a standard Seq2Seq model as follows:

export NPROC_PER_NODE=8  # Use 1 for single-GPU training
export N_NODES=1  # Use >1 for multi-node training (where each node has NPROC_PER_NODE GPUs)
BS=128  # Make batch size smaller if GPU goes out-of-memory. Effective batch size is BS*NPROC_PER_NODE*N_NODES

if [[ $NPROC_PER_NODE -gt 1 ]]; then DIST_OPTS="-m torch.distributed.launch --nproc_per_node=$NPROC_PER_NODE"; else DIST_OPTS=""; fi
mkdir -p $OUTPUT_DIR
python $DIST_OPTS --exp_name mt.$DATA_FOLDER --data_path $DATA_PATH --dump_path ./dumped/ --reload_model "$MLM_INIT,$MLM_INIT" --encoder_only false --emb_dim 2048 --n_layers 6 --n_heads 16 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --use_lang_emb true --lgs 'mh-sh' --mt_steps 'mh-sh,sh-mh' --stopping_criterion 'valid_mh-sh_mt_bleu,2' --validation_metrics 'valid_mh-sh_mt_bleu' --eval_bleu true --epoch_size $((2*NUM_TRAIN/(NPROC_PER_NODE*N_NODES))) --optimizer 'adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001' --tokens_per_batch 1024 --batch_size $BS --max_len 128 --bptt 128 --split_data true

You can also use the trained Seq2Seq model checkpoint as the pre-trained initialization (MLM_INIT) for USeq2Seq training, as our Curriculum Seq2Seq approach does (see Appendix).

MLM Pre-training (Optional)

New: Download MLM pretraining data here

To pre-train your own MLM initialization (used as MLM_INIT), use the below commands:

# Set the following parameters based on your hardware
export NPROC_PER_NODE=8  # Use 1 for single-GPU training
export N_NODES=8  # Use >1 for multi-node training (where each node has NPROC_PER_NODE GPUs)

# Copy XLM's English pre-trained MLM weights, which we use to initialize our MLM training
mv mlm_en_2048.pth dumped/xlm_en/

# MLM pre-training (on same data as above)
if [[ $NPROC_PER_NODE -gt 1 ]]; then DIST_OPTS="-m torch.distributed.launch --nproc_per_node=$NPROC_PER_NODE"; else DIST_OPTS=""; fi
NUM_TRAIN=`wc -l < data/umt/$DATA_FOLDER/processed/`
# For fp16: Add "--fp16 true --amp 1" below
python $DIST_OPTS --exp_name mlm.$DATA_FOLDER --data_path data/umt/$DATA_FOLDER/processed --dump_path ./dumped/ --reload_model 'dumped/xlm_en/mlm_en_2048.pth' --emb_dim 2048 --n_layers 12 --n_heads 16 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --use_lang_emb true --lgs 'mh-sh' --clm_steps '' --mlm_steps 'mh,sh' --stopping_criterion '_valid_mlm_ppl,0' --validation_metrics '_valid_mlm_ppl' --epoch_size $EPOCH_SIZE --optimizer "adam_inverse_sqrt,lr=0.00003,beta1=0.9,beta2=0.98,weight_decay=0,warmup_updates=$((EPOCH_SIZE/EFFECTIVE_BS))" --batch_size $BS --max_len 128 --bptt 128 --accumulate_gradients 1 --word_pred 0.15 --sample_alpha 0

QA Model Training

With a trained decomposition model, we can generate decompositions for multi-hop questions (train and valid sets), and train a question answering model to use the decompositions (below we use our pre-trained decomposition model which you downloaded):

# Generate decompositions
# Point to model directory (change the final directory number/string/id below to match the directory string from the previous Unsupervised Seq2Seq training command)
MODEL_NO="$(echo $MODEL_DIR | rev | cut -d/ -f1 | rev)"
for SPLIT in valid train; do
    # Note: Decrease batch size below if GPU goes out of memory
    cat data/umt/all/processed/$ | python --exp_name translate --src_lang mh --tgt_lang sh --model_path $MODEL_DIR/best-valid_mh-sh-mh_mt_effective_goods_back_bleu.pth --output_path $MODEL_DIR/$ --batch_size 48 --beam_size $BEAM --length_penalty $LP --sample_temperature $ST

# Convert Sub-Qs to SQUAD format
cd $MAIN_DIR/pytorch-transformers
for SPLIT in valid train; do
    python --model_dir $MODEL_DIR --data_folder all --sample_temperature $ST --beam $BEAM --length_penalty $LP --seed $SEED --split $SPLIT --new_data_format

# Answer sub-Qs
for SPLIT in "dev" "train"; do
for NUM_PARAGRAPHS in 1 3; do
    # For fp16: Add "--fp16 --fp16_opt_level O2" below
    python examples/ --model_type roberta --model_name_or_path roberta-large --train_file $DATA_FOLDER/train.json --predict_file $DATA_FOLDER/$SPLIT.json --do_eval --do_lower_case --version_2_with_negative --output_dir checkpoint/roberta_large.hotpot_easy_and_squad.num_paragraphs=$NUM_PARAGRAPHS --per_gpu_train_batch_size 64 --per_gpu_eval_batch_size 32 --learning_rate 1.5e-5 --max_query_length 234 --max_seq_length 512 --doc_stride 50 --num_shards 1 --seed 0 --max_grad_norm inf --adam_epsilon 1e-6 --adam_beta_2 0.98 --weight_decay 0.01 --warmup_proportion 0.06 --num_train_epochs 2 --write_dir $DATA_FOLDER/$NUM_PARAGRAPHS --no_answer_file

# Ensemble sub-answer predictions
for SPLIT in "dev" "train"; do
    python --seeds_list 1 3 --no_answer_file --split $SPLIT --preds_file1 data/hotpot.umt.all.model=$$ST.beam=$BEAM.lp=$LP.seed=$SEED/{}/nbest_predictions_$SPLIT.json

# Add sub-questions and sub-answers to QA input
FLAGS="--atype sentence-1-center --subq_model roberta-large-np=1-3 --use_q --use_suba --use_subq"
python --subqs_dir data/hotpot.umt.all.model=$$ST.beam=$BEAM.lp=$LP.seed=$SEED --splits train dev --num_shards 1 --model_dir $MODEL_DIR --sample_temperature $ST --beam $BEAM --length_penalty $LP --seed $SEED --subsample_data --use_easy --use_squad $FLAGS

# Train QA model
export NGPU=8  # Set based on number of available GPUs
if [ $NGPU -gt 1 ]; then DIST_OPTS="-m torch.distributed.launch --nproc_per_node=$NGPU"; else DIST_OPTS=""; fi
if [ $NGPU -gt 1 ]; then EVAL_OPTS="--do_eval"; else EVAL_OPTS=""; fi
export MASTER_PORT=$(shuf -i 12001-19999 -n 1)
# For fp16: Add "--fp16 --fp16_opt_level O2" below
python $DIST_OPTS examples/ --model_type roberta --model_name_or_path roberta-large --train_file data/$TN/train.json --predict_file data/$TN/dev.json --do_train $EVAL_OPTS --do_lower_case --version_2_with_negative --output_dir $OUTPUT_DIR --per_gpu_train_batch_size $((64/NGPU)) --per_gpu_eval_batch_size 32 --learning_rate 1.5e-5 --master_port $MASTER_PORT --max_query_length 234 --max_seq_length 512 --doc_stride 50 --num_shards 1 --seed $RANDOM_SEED --max_grad_norm inf --adam_epsilon 1e-6 --adam_beta_2 0.98 --weight_decay 0.01 --warmup_proportion 0.06 --num_train_epochs 2 --overwrite_output_dir

New: our trained multihop model checkpoints are available here:

Creating Alternate Pseudo-Decompositions

We can also create pseudo-decompositions using other embedding methods aside from FastText, as described in the Appendix. To do so, use the functions in pytorch-transformers/pseudoalignment/pseudo_decomp_{paired_random|fasttext|tfidf|bert|variable}.py, e.g., by running:

python pseudoalignment/ \
    --split train    # decompose the hotpotQA training question
    --min_q_len 4    # minimum length of short questions (tokens)
    --max_q_len 20   # maximum length of short questions (tokens)
    --beam_size 100  # subset of short questions to search exhaustively over for each complex question
    --data_folder data/umt/decomposition_name  # path to dump the results to

The different pseudo-decomposition methods are:

  • - decompose using bag of fasttext vectors
  • - randomly pair short questions (for ablations/comparisons)
  • - decompose using bag of tfidf vectors
  • - decompose using bag of facttext vectors, but using a variable number of subquestions (see Appendix)
  • - decompose using bert embeddings (requires generating the bert embeddings first with
  • - decompose using bert NSP embeddings (not in the paper) (requires generating the bert embeddings first with

Variable Number of Sub-Questions

To train a decomposition model to generate a variable number of sub-questions, you'll need to make the following changes:

  • Train on variable-length pseudo-decompositions, created using python pseudoalignment/ (see above).
  • Use a version of the unsupervised stopping criterion which only counts bad decompositions as those with N<2 sub-questions (as opposed to N!=2 sub-questions). Simply add the flag --one_to_variable when training (Unsupervised) Seq2Seq models with XLM/
  • Have the single-hop QA model answer an arbitrary number of sub-questions, instead of a maximum of 2 sub-questions. Simply add --one_to_variable to the FLAGS variable used in the "QA Model Training" section earlier.

Data mined from Common Crawl

Data mined from common crawl using our fasttext classifiers can be found here


    title={Unsupervised Question Decomposition for Question Answering},
    author={Ethan Perez and Patrick Lewis and Wen-tau Yih and Kyunghyun Cho and Douwe Kiela},


See the LICENSE file for more details.

unsuperviseddecomposition's People


ethanjperez avatar patrick-s-h-lewis avatar


 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar


 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

unsuperviseddecomposition's Issues

Training with negative data

Thanks for releasing the code of such great work. I have a question in terms of training.

When training, can I add negative pseudo-decomposition in the training set? I mean normally we have {Q, q1, q2} is a golden pseudo decomposition. Can I add another pseudo decomposition {Q, q3, q4} when training, noticing that {Q, q3, q4} is not the right decomposition?

Looking forward to hearing from you.


Is it working on Windows system?

I have been trying to install the system on Windows 10 without success. I got stuck at the step install fastBPE package. It requires <sys/mman.h > file but Windows does not have this file. Is there work around way?

Processed Common Crawl Questions


I notice that in the paper you augment the data with the questions from Common Crawl. It seems that I do not find those processed questions / raw files in the repo.

Thank you so much!

The folder to put custom training data


I want to use your code to train my own question decomposition model, with my own questions and sub-questions data. Nevertheless, it is quite difficult for me to look up the folder to replace the training data you used with my own training data.

Can you help me specify the folder to put my own training data in?

Thank you.

Training model with custom data

Hi, thanks for releasing the code and amazing work!

I have a question in terms of using the code.
How can I train another question decomposition model with my custom data?
In my case, I have only 1 single question per 1 multi-hop question, so the data could be a bit different.


cannot import name 'RobertaForQuestionAnswering' from 'pytorch_transformers'


I have followed the instruction on your site till the step below:
# Answer sub-Qs DATA_FOLDER=data/hotpot.umt.all.model=$$ST.beam=$BEAM.lp=$LP.seed=$SEED for SPLIT in "dev" "train"; do for NUM_PARAGRAPHS in 1 3; do # For fp16: Add "--fp16 --fp16_opt_level O2" below python examples/ --model_type roberta --model_name_or_path roberta-large --train_file $DATA_FOLDER/train.json --predict_file $DATA_FOLDER/$SPLIT.json --do_eval --do_lower_case --version_2_with_negative --output_dir checkpoint/roberta_large.hotpot_easy_and_squad.num_paragraphs=$NUM_PARAGRAPHS --per_gpu_train_batch_size 64 --per_gpu_eval_batch_size 32 --learning_rate 1.5e-5 --max_query_length 234 --max_seq_length 512 --doc_stride 50 --num_shards 1 --seed 0 --max_grad_norm inf --adam_epsilon 1e-6 --adam_beta_2 0.98 --weight_decay 0.01 --warmup_proportion 0.06 --num_train_epochs 2 --write_dir $DATA_FOLDER/$NUM_PARAGRAPHS --no_answer_file done done

I have installed the Pytorch-transformers by:
pip install pytorch-transformers

However, I found an error:
cannot import name 'RobertaForQuestionAnswering' from 'pytorch_transformers'

Any idea for this?


Mohammad Yani

Can this repo be run using Torch-CPU?

I have a Macbook that does not have a GPU or CUDA installed. Is there a way to run the unsupervised training and subsequent parts in the README without a GPU? I run into errors whenever there are lines like
import apex # CUDA GPU cabability

FileNotFoundError: [Errno 2] No such file or directory: '/UnsupervisedDecomposition/XLM/data/hotpot-orig/qid2passage_entities.json'


When I tried to execute command:
python --subqs_dir data/hotpot.umt.all.model=$$ST.beam=$BEAM.lp=$LP.seed=$SEED --splits train dev --num_shards 1 --model_dir $MODEL_DIR --sample_temperature $ST --beam $BEAM --length_penalty $LP --seed $SEED --subsample_data --use_easy --use_squad $FLAGS

I got the error: FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/son/UnsupervisedDecomposition/XLM/data/hotpot-orig/qid2passage_entities.json'

I check previous scripts and I cannot find how I can generate qid2passage_entities.json file.
Could you specify how to create that file?


Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.