Giter VIP home page Giter VIP logo

dureader-bert's Introduction

DuReader BERT

2019 DuReader 机器阅读理解模型。

base code: Dureader-Bert

预训练模型下载: BERT-base-chinese, wwm & wwm-ext

DuReader数据下载: DuReader_v2.0_preprocessed.zip

Summary

48.89 Baseline
49.65 (+0.76) Hyperparameter optimization
50.01 (+0.36) Paragraph selection – general BERT QP classification model
50.27 (+0.26) Paragraph selection – fine-tuning BERT QP classification model
50.55 (+0.28) Sample selection – use full range of match scores
50.89 (+0.34) Improved pre-training – wwm-ext
51.46 (+0.57) Model Improvement 
51.5 (+0.04) Data Augmentation – CMRC and DRCD
51.57 (+0.07) Data Augmentation – synonym word replacement
51.78 (+0.21) Ensemble
Single: ROUGE-L 51.57, BLEU-4: 48.7
Ensemble: ROUGE-L 51.78, BLEU-4: 48.37

Code

  • handle_data文件夹是处理DuReader的数据,与DuReader有关,与bert没有多大关系。
  • dataset文件夹是处理中文数据的代码,大致是将文字转化为bert的输入: (inputs_ids,token_type_ids,input_mask),然后做成dataloader。
  • predict文件夹是用来预测的,基本与训练时差不多,一些细节不一样(输出)。
  • 总的来说,只要输入符合bert的输入: (inputs_ids,token_type_ids,input_mask)就可以了。

How to Run

Dependencies

  • python3
  • torch 1.0
  • packages: pytorch-pretrained-bert, tqdm, torchtext

Installation with pip

pip install -r requirements.txt

Preprocess the data

将下载的 DuReader 数据放在data文件夹下。

|- data
| |- trainset
| | |- search.train.json
| | |- zhidao.train.json
| |- devset
| | |- search.dev.json
| | |- zhidao.dev.json
| |- testset
| | |- search.test.json
| | |- zhidao.test.json
# 数据处理
cd handle_data && sh run.sh
# 制作dataset
cd dataset && python run_squad.py
# 制作预测dataset
cd predict && python util.py

制作更多dataset

# 制作 qp-relevance 预测dataset
cd handle_data && sh run_qp.sh && cd ../predict && python util.py --dev-search-input-file '../../data/extracted/devset/search-qp.dev.json' --dev-zhidao-input-file '../../data/extracted/devset/zhidao-qp.dev.json' --predict-example-files 'predict-qp.data'
# 制作 no-match-score trainset
cd dataset && python run_squad_no_match_score.py
# 制作 synonym trainset(同义词替换训练集)
cd dataset && python run_squad_synonym.py

Train

python train.py --model-name 'best_model'

Predict

predict front 5 paragraphs:

cd predict && python predicting.py --model-name 'best_model' --result-file-name 'best_model.json'

predict top 5 qp-relevance score paragraphs :

cd predict && python predicting.py --model-name 'best_model' --result-file-name 'best_model-qp.json' --source-file-name predict-qp.data

ensemble predicting:

cd predict && python ensemble-predicting.py --model-names '["best_model1", "best_model2", "best_model3"]' --model-nums '[6, 6, 6]' --config-names '["bert_config.json", "bert_config.json", "bert_config.json"]' --result-file-name 'ensemble-qp.json' --source-file-name predict-qp.data

Eval

cd metric && python mrc_eval.py best_model.json ref.json v1

All-in-one (train, predict, eval)

sh train_and_predict.sh 8 2 512 3e-05 4 6

Reproduce the best result

train-synonym.data 1ep + train-no-match-score.data 2ep

python train.py --epochs 1 --model-name 'best_model_synonym' --trainset-name train-synonym.data --test-lines 186139 --state-dict pytorch_model_wwm_ext.bin --model-num 6
&& python train.py --model-name 'best_model_synonym' --state-dict model_dir/best_model_synonym --model-num 6 --trainset-name train-no-match-score.data --test-lines 229345
&& cd predict
&& python predicting.py --model-name 'best_model_synonym' --result-file-name 'best_model_synonym-qp.json' --source-file-name predict-qp.data --model-num 6 && cd ../metric
&& python mrc_eval.py best_model_synonym-qp.json ref.json v1

dureader-bert's People

Contributors

unbiarirang avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.