Giter VIP home page Giter VIP logo

mlec-qa's Introduction

MLEC-QA

This repository contains the data and baseline code of The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021) paper "MLEC-QA: A Chinese Multi-Choice Biomedical Question Answering Dataset".

If you would like to use the data or code, please cite:

@inproceedings{li-etal-2021-mlec,
    title = "{MLEC-QA}: {A} {C}hinese {M}ulti-{C}hoice {B}iomedical {Q}uestion {A}nswering {D}ataset",
    author = "Li, Jing  and
      Zhong, Shangping  and
      Chen, Kaizhi",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.698",
    pages = "8862--8874",
}

MLEC-QA is a Chinese multi-choice Biomedical Question Answering Dataset. Questions in MLEC-QA are collected from the National Medical Licensing Examination in China (NMLEC), which are carefully designed by human experts to evaluate professional knowledge and skills for those who want to be medical practitioners in China.

We hope the release of the MLEC-QA dataset can serve as a valuable resource for research and evaluation in Open-domain QA, and also make advances for Biomedical Question Answering systems.

Dataset

Download MLEC-QA dataset: Google Drive

MLEC-QA is composed of 5 subsets with 136,236 Chinese multi-choice biomedical questions with extra materials (images or tables) annotated by human experts, and covers the following biomedical sub-fields:

  • Clinic (临床): Clinic_{train,dev,test}.json
  • Stomatology (口腔): Stomatology_{train,dev,test}.json
  • Public Health (公共卫生): PublicHealth_{train,dev,test}.json
  • Traditional Chinese Medicine (中医): TCM_{train,dev,test}.json
  • Traditional Chinese Medicine Combined with Western Medicine (中西医结合): CWM_{train,dev,test}.json

The JSON dataset file format is as follows:

{
	"qid":The question ID,
	"qtype":["A1型题", "B1型题", "A2型题", "A3/A4型题"],
	"qtext":Description of the question,
	"qimage":Image or table path (if any),
	"options":{
		"A":Description of the option A,
		"B":Description of the option B,
		"C":Description of the option C,
		"D":Description of the option D,
		"E":Description of the option E
	},
	"answer":["A", "B", "C", "D", "E"]
}   	

Baselines

Install the requirements:

cd code
pip install -r requirements.txt

Control Methods

  • Random: For each question, an option is randomly chosen as the answer from five candidate options. We perform this experiment five times and average the results as the baseline of the Random method.
  • Constant_j: For each question, the $j^{th}$ option is always chosen as the answer to obtain the accuracy distribution of five candidate options.
  • Mixed: The Mixed method simulates how humans solving uncertain questions, and consists of the following 3 strategies:
    • The correct rate of choosing "All of the options above is correct/incorrect" is much higher than the other options.
    • Supposing the length of options is roughly equal, only one option is obviously longer with more detailed and specific descriptions, or is obviously shorter than the other options, then choose this option.
    • The correct option tends to appear in the middle of candidate options. The three strategies are applied in turn. If any strategy matches, then the option that matches the strategy is chosen as the answer.

Open-Domain QA Methods

Open-Domain QA Methods is consist of a two-stage retriever-reader framework:

  • A retriever finding documents that (might) contain an answer from a large collection of documents. We adopt Chinese Wikipedia dumps as our information sources, and use a distributed search and analytics engine, ElasticSearch, as the document store and document retriever.

  • A reader finding the answer in given documents retrieved by the retriever. We fine-tune five pre-trained language models for machine reading comprehension as the reader.

Document Retriever

  1. Download Elasticsearch 7.10.1, Kibana 7.10.1 and run the servers locally with out-of-the-box defaults.
  2. Create an inverted index of Chinese Wikipedia dumps in Elasticsearch using wiki_zh_json2es.py.
  3. Run retriever.py.

Document Reader

Models

The pre-trained language models used on Open-Domain QA Methods can be downloaded from huggingface, and using scripts from the scripts directory to convert them into the format that reader can load directly.

Usage

run_mlecqa.py [-h] [--pretrained_model_path PRETRAINED_MODEL_PATH]
                 [--output_model_path OUTPUT_MODEL_PATH]
                 [--vocab_path VOCAB_PATH] [--spm_model_path SPM_MODEL_PATH]
                 --train_path TRAIN_PATH --dev_path DEV_PATH
                 [--test_path TEST_PATH] [--config_path CONFIG_PATH]
                 [--embedding {word,word_pos,word_pos_seg,word_sinusoidalpos}]
                 [--max_seq_length MAX_SEQ_LENGTH]
                 [--relative_position_embedding]
                 [--relative_attention_buckets_num RELATIVE_ATTENTION_BUCKETS_NUM]
                 [--remove_embedding_layernorm] [--remove_attention_scale]
                 [--encoder {transformer,rnn,lstm,gru,birnn,bilstm,bigru,gatedcnn}]
                 [--mask {fully_visible,causal,causal_with_prefix}]
                 [--layernorm_positioning {pre,post}]
                 [--feed_forward {dense,gated}] [--remove_transformer_bias]
                 [--layernorm {normal,t5}] [--bidirectional]
                 [--factorized_embedding_parameterization]
                 [--parameter_sharing] [--learning_rate LEARNING_RATE]
                 [--warmup WARMUP] [--fp16] [--fp16_opt_level {O0,O1,O2,O3}]
                 [--optimizer {adamw,adafactor}]
                 [--scheduler {linear,cosine,cosine_with_restarts,polynomial,constant,constant_with_warmup}]
                 [--batch_size BATCH_SIZE] [--seq_length SEQ_LENGTH]
                 [--dropout DROPOUT] [--epochs_num EPOCHS_NUM]
                 [--report_steps REPORT_STEPS] [--seed SEED]
                 [--max_choices_num MAX_CHOICES_NUM]
                 [--tokenizer {bert,char,space}]

The example of using run_mlecqa.py:

python3 run_mlecqa.py --pretrained_model_path models/bert-base.bin \
--vocab_path models/google_zh_vocab.txt \
--train_path datasets/train.json \
--dev_path datasets/dev.json \
--test_path datasets/test.json \
--epochs_num 12 \
--batch_size 1 \
--seq_length 512 \
--max_choices_num 5 \
--learning_rate 2e-6 \
--report_steps 100 \

The actual batch size is --batch_size times --max_choices_num.

mlec-qa's People

Contributors

judenpech avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

yyht

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.