Giter VIP home page Giter VIP logo

paniniqa's Introduction

PaniniQA

Repo for the TACL 2023 paper "PaniniQA: Enhancing Patient Education Through Interactive Question Answering"

1. Dataset

We open source two datasets:

  1. Dataset 1 - 456 annotated discharge instructions from MIMIC-III Clinical Database
  2. Dataset 2 - 100 synthesized discharge instructions generated by pre-trained neural models

Detailed instructions on Dataset 1

The 456 discharge instructions in Dataset 1 are from the MIMIC-III Clinical Database, a large freely-available database comprising deidentified health-related data associated with patients who stayed in critical care units of the Beth Israel Deaconess Medical Center. We provide all the annotation files in the folder data/annotated_dataset/annotated_files/. There are two types of files in this folder:

  1. Files ending with evt.csv records key medical events in each discharge instruction.
  2. Files ending with rel.txt records key medical relations in each discharge instruction. Each file is represented by a unique identifier in the form of row_id-subject_id-hadm_id.

Due to data security agreement, we can not release the discharge instructions in this repo, you will need to acquire the discharge instructions from MIMIC-III yourself. To acquire these discharge instructions, please first obtain the credential from here. After acquiring the credential, please visit this link to download the file NOTEEVENTS.csv.gz.

Then run the following command to extract the clinical instructions from the downloaded file:

python scripts/data_process/extract_mimic.py \
  --input_file PATH/TO/NOTEEVENTS.csv \
  --anno_dir data/annotated_dataset/annotated_files/ \
  --output_dir data/annotated_dataset/raw_notes/

Once you have finished running the above command, you should be able to see 456 txt files (discharge instructions) in your folder data/annotated_dataset/raw_notes/

Creating train/validation/test sets for Key Medical Event Identification

Creating train/validation/test sets for Key Medical Relation Identification

You may use the following command to create the train / validation / test set for medical relation classification:

CUDA_VISIBLE_DEVICES=0 python scripts/data_process/process_rel_cls.py \
  --anno_dir data/annotated_dataset/annotated_files/ \
  --note_dir data/annotated_dataset/raw_notes/ \
  --split_file data/annotated_dataset/split.json \
  --output_dir data/rel_cls/

This command will generate the datasets for relation classification in the directory data/rel_cls/.

Detailed instructions on Dataset 2

We provide 30 synthesized discharge instruction in data/synthesized_dataset/raw_notes/. These discharge instructions were generated using the models in this repo. We also provide the human annotated cloze questions in data/synthesized_dataset/cloze/.

2. Identifying Key Medical Events

3. Identifying Key Medical Relations

To train and evaluate the performance of medical relation classification model, run the following command:

CUDA_VISIBLE_DEVICES=0 python run_classification.py \
  --model_name_or_path path/to/pre-trained/model \  # We use the RoBERTa-large-PM-M3-Voc-hf from the following site: https://github.com/facebookresearch/bio-lm
  --train_file data/rel_cls/train.json \
  --validation_file data/rel_cls/test.json \
  --max_length 512 \
  --per_device_train_batch_size 8 \
  --learning_rate 2e-5 \
  --num_train_epochs 5 \
  --output_dir path/to/output/directory \
  --with_tracking \
  --pos_weight 1.5

paniniqa's People

Contributors

pengshancai avatar

Stargazers

Wenting Zhao avatar Lingxi Li avatar 唐国梁Tommy avatar

Watchers

Kostas Georgiou avatar  avatar

Forkers

keshavaspanda

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.