Giter VIP home page Giter VIP logo

bert-dureader's Introduction

BERT-Dureader

lic2019Dureader 2.0比赛BERT实现, 最终分数为Rouge-L: 49.82, Bleu-4: 51.9,排名第16.主要依赖为:

python==3.6
torch==0.4.1
pytorch_pretrained_bert==0.6.1

下载数据

从lic2019阅读理解赛道报名下载数据,这里只用到processed数据,之后将这些数据放入都data文件夹下,格式为

data:.
├─dev_preprocessed
│  └─devset
├─test1_preprocessed
│  └─test1set
├─test2_preprocessed
│  └─test2set
└─train_preprocessed

官方Baseline仓库也提供了数据下载脚本,可以通过脚本下载之后将数据放为此格式

前往BERT google官方仓库下载中文预训练模型(chinese_L-12_H-768_A_12)解压放在data目录下,使用pytorch_pretrained_bert将tensorflow模型转化为pytorch模型

export BERT_BASE_DIR=/path/to/bert/chinese_L-12_H-768_A_12

pytorch_pretrained_bert convert_tf_checkpoint_to_pytorch \
  $BERT_BASE_DIR/bert_model.ckpt \
  $BERT_BASE_DIR/bert_config.json \
  $BERT_BASE_DIR/pytorch_model.bin

选择候选文档

在选择候选文档来输入到BERT进行预测时,参考了Passage Re-ranking with BERT这篇文章的方法,使用BERT进行段落排序。但由于数据准备和时间稍显仓促,词方法检索出来的文档并未得到理想状态。之中可能还存在些许问题,尚有提升空间。首先从train_preprocessed中准备一个训练集,之后再通过BERT进行二分类,之后通过得到的相关度分数进行排序。但我在这里得到的排序结果并不好,所有对zhidao和search两个数据集采用了不同的后处理方法。

$ cd retriever
$ python prepare.py
$ python run_classifier.py --data_dir ./retriever_data/ --bert_model ../data/chinese_L-12_H-768_A-12/ --task_name MRPC --output_dir ./retriever_output --do_train --do_eval --train_batch_size 8

训练完成之后模型保存在文件夹 ./retriever_output 中,接下来使用训练好的模型,对test测试集筛查选择相关的备选文档

$ python bert_rank.py --test_file ../data/test1_preprocessed/test1set/zhidao.test1.json --output_path ../zhidao_test_rank_output.json
$ python bert_rank.py --test_file ../data/test1_preprocessed/test1set/search.test1.json --output_path ../search_test_rank_output.json

使用BERT训练抽取模型

先将训练集进行预处理转化为squad格式,再通过run_dureader.py对bert模型进行微调,模型结果保存在reader_output中

$ cd reader
$ python prepare_squad.py
$ python run_dureader.py --bert_model ../data/chinese_L-12_H-768_A_12 --do_train --train_file ./dureader_train.json --train_batch_size 12 --learning_rate 3e-5 --num_train_epochs 3.0 --max_seq_length 384 --doc_stride 128 --output_dir ./reader_output

结合检索和抽取对测试集进行预测

$ python prepare_test.py
$ python predict_dureader.py --bert_model ./data/chinese_L-12_H-768_A_12/ --bin_path ./reader/reader_output/pytorch_model.bin --predict_file ./dureader_test.json --output ./test1_output

yes_no三分类

对于附加的yes_no_depends三分类实验,也只需要从train_set中准备一个分类训练集,使用pytorch_pretrained_bert的run_classifier.py进行三分类即可。最终我得到的结果为acc:0.7左右,加上之后对总成绩大约有0.4个点的提升,代码就不放在这里了。

bert-dureader's People

Contributors

cooscao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

bert-dureader's Issues

Meaning of Dureader test1, test2?

I have Dureader2 dataset.
But I don't know what Meaning of test1, test2.
I see Dureader Preprocessed data's train, dev test folder all has two files: search, zhidao.

What test1, test2 means?

Thanks

请假下大佬增加特征问题

感谢大佬的分享,请问下你有试着在bert embedding输出后,concat其他文本特征吗?比如问题类型,或者words in question这些,我加了之后效果变得不好,是我打开方式的问题吗?感谢大佬的解答

却少文件

感谢分享,predict_dureader.py文件是空的;此外prepare_submit.py 提交文件类型之类的都是空的?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.