Move paths and file names to main()

Clinical notes redundancy

The aim of this work is to investigate the impact of clinical notes redundancy, generated by copy-paste practice, on natural language processing (NLP) models. Given the widespread use of NLP methods in clinical research, it becomes fundamental to understand whether redundancy should be removed from notes agnostically as a preprocessing step or if it can be dealt with on a case-by-case basis depending on the task. Towards this goal, we first estimate the influence of redundancy on language models intrinsically measuring model's performance through perplexity (PPL) when trained on redundant and non-redundant notes and evaluated on real-world clinical text. Secondly, we investigate how redundancy can affect the results of specific NLP tasks (e.g., classification, concept extraction).

Data

For this project, we considered clinical notes from the n2c2 NLP research data sets for i2b2 challenges. Notes are publicly available and were released to solve different NLP tasks, see dataset description in Table 1.

Challenge	Year	Task	Data source	Ref
Smoking status	2006	Classification	Discharge summaries from PH	[1]
Obesity and comorbidities	2008	Information extraction and classification	Discharge summaries from PH/RPDR	[2]
Medication extraction	2009	Information extraction	Discharge summaries from PH	[3]
Concepts, assertions, and relations	2010	Concept extraction	Discharge summaries from PH/BIDMC/UPMC and progress reports from UPMC	[4]
Coreference resolution	2011	Coreference chain identification	Discharge summaries from PH/BIDMC/UPMC, progress notes from UPMC, clinical and pathology reports from Mayo Clinic, and discharge, radiology, surgical pathology reports, and other from UPMC	[5]
Temporal relations	2012	Information extraction	Discharge summaries from PH/BIDMC	[6]
CAD risk factors	2014	Classification, feature selection	PH EMRs (MGH and BWH)	[7]
Cohort selection	2018	Classification	Records from 2014 challenge	[8]
Medication extraction and ADEs	2018	Information extraction, relation classification	Discharge summaries from MIMIC-III	[9]

PH: Partners Healthcare; RPDR: Research Patient Data Repository; BIDMC: Beth Israel Deaconess Medical Center; UPMC: University of Pittsburgh Medical Center; MGH: Massachusetts General Hospital; BWH: Brigham and Women's Hospital

Table 1. n2c2 datasets description.

Pipeline

Implemented modules:

create_dataset: it takes as input files downloaded from n2c2 NLP research data sets and it combines them into a unique output. All raw-text clinical notes are output in a table format (columns: NOTE_ID, NOTE_TEXT).
note_tokenization: it takes as input all notes and it tokenizes them at the sentence level. It saves tokenized notes to a file with a sentence per line and each note separated by an empty line.
create_pretraining: it creates the DatasetDict object for ClinicalBERT pretraining. Code modified from ClinicalBERT and BERT.
fine_tune_bert: this module pretraines the already pretrained ClinicalBERT model on masked language model and next sequence prediction tasks and it evaluates it at each epoch on the test set (which here serves as validation). The name of the module reflects the fact that the ClinicalBERT has already been pretrained on clinical notes. It returns the best model with early stopping and performances on train and validation (i.e., loss, accuracy, and PPL), see metrics.py.

Create n2c2 dataset

We consider the clinical notes from the n2c2 datasets for i2b2 challenges. Some notes are shared among tasks hence the create_dataset.py module combines all unique notes from challenges into training and test sets. Edit utils.py to specify challenge and folder names. Modify input/output directories and output file name in create_dataset.sh if needed. Then run:

sh create_dataset.sh

Output: (1) train|test_n2c2_dataset.txt files with (note_id, note_text) columns; (2) train|test_newk_to_oldk.txt files with NOTE_ID, CH_ID, CH_NAME columns storing the new-to-old note id correspondence. Output folder: ./datasets/n2c2_datasets.

Note tokenization

The note_tokenization.py module tokenizes the notes generated by the create_dataset step. The tokenization process happens in two steps:

By sentence: sentences are defined as (a) delimited by full stop "."; (b) item in list (numeric or bullet point);
At word-level: we use spacy en_core_sci_md-0.4.0 model tokenizer and defined a custom tokenizer for special tokens. Specifically, we consider as a unique word (a) de-identifiers; (b) dates; (c) times; (d) phone numbers; (e) lab/test results; and (f) abbreviations.

Output: a train|test_sentences.txt file in the datasets/n2c2_datasets output folder that stores a sentence per line. Different documents are separated by a blank line. Tokenization at word level can be obtained splitting the sentences at " " (space character).

To run the code first modify the required fields in note_tokenization.sh then run:

sh note_tokenization.sh

Create DatasetDict for BERT pretraining

The create_pretraining.py module loads the datasets.Dataset object created following the huggingface guide, see Dataset loading section. It outputs a DatasetDict object for BERT model pretraining and saves it as n2c2datasets_forClinicalBERTfinetuning_maxseqlen<N>.pkl in the output folder .

Because we want to intrinsically evaluate the model's performance through PPL, we need to evaluate it on a language model task. Hence, for each sentence we replaced the last token (before final [SEP]) with [MASK] during preprocessing. This to measure PPL in terms of the ability of the model to predict that last token with only previous words as context. Word-level tokenization and sentence length were done according to the sub-word vocabulary used for ClinicalBERT and based on the maximum number of words per sentence allowed by the model's configuration (e.g., 128), respectively.

Run:

sh create_pretraining.sh

specifying the following hyperparameters

max_seq_length: sequence length;

max_predictions_per_seq: maximum number of [MASK] tokens per sequence;

short_seq_prob: probability of creating sequences shorter than max_seq_length;

masked_lm_prob: percentage of token positions randomly selected per sentence;

dupe_factor: number of times data should be duplicated with different masks.

Fine-tuning pretrained ClinicalBERT

The fine_tune_bert.py module takes as input the DatasetDict object with clinical notes preprocessed for masked language model (MLM) and next sequence prediction (NSP) tasks and it outputs the best model configuration after training. The module combines train.py, test.py, and metrics.py to train the pretrained ClinicalBERT on our datasets and evaluate its performance on the validation set in terms of PPL, computed as the exponential of the mean cross-entropy loss when predicting the last word of each sentence.

Run:

sh fine_tune_bert.sh

Hyperparameters:

epochs: number of complete training passes;

batch_size: number of samples to process (e.g., 256, 1024 from the BERT paper);

learning_rate: learning rate for gradient descent (e.g., 1e-4 from the BERT paper);

num_training_steps: total number of training steps (i.e., number of sentences * epochs);

num_warmup_step: number of warmup steps for AdamW optimizer with linear learning rate decay with warmup, it should be 1% of the total training steps;

patience: (number of epochs - 1) before early stopping.

Remarks: BERT experiments run 1024128 or 256512 tokens/batch on 3.3B words. They trained the model for 40 epochs, which correspond to ~1M training steps (epochs*batches). Their warmup steps are 10000, i.e., 1% of the training steps, and half of one epoch, which includes 25177 batches.

For our experiment we have 256*128 tokens/batch, which correspond to ~900 batches. For a warmup steps of 400, we have to fine-tune our MLM for 40,000 steps, i.e., 45 epochs (although we include the early stopping with patience 5 to avoid overfitting).

Dataset loading

The module datasets/n2c2_datasets/n2c2_datasets prepares the input Dataset objects. Several configurations are implemented, according to the needed task. It was implemented based on the huggingface guide.

The script datasets/n2c2_datasets/n2c2_datasets.py organizes notes in a DatasetDict object, with keys "train|test" and values Dataset. Available configurations are:

language_model where objects have features "sentence", "document", "challenge" and input datasets are found in datasets/n2c2_datasets;
smoking_challenge where objects have features "note", "id", "label" and input datasets are found in folder datasets/2006_smoking_status;

The script was tested running the following command in the project folder (root of the datasets folder):

Step 1

datasets-cli test datasets/n2c2_datasets --save_infos --all_configs

which return a dataset_infos.json file with dataset information.

Step 2

In order to generate a dummy version for each dataset configuration we can run:

python ./datasets/n2c2_datasets/datasets_cli.py dummy_data ./datasets/n2c2_datasets \
--auto_generate \
--n_lines=100 \
--match_text_files='train_sentences.txt,test_sentences.txt'

Dummy compressed folders can then be found in:

datasets/n2c2_datasets/dummy/0.0.1/dummy_data.zip for the language_model configuration;
datasets/{config.data_folder}/dummy/0.0.1/dummy_data.zip for the other configurations.

A cached version of the data will be stored at ~/.cache/huggingface/datasets/n2c2_dataset/default/0.0.1.

Remark: this loading dataset script can be edited to add dataset configurations other than the "language model", e.g., for specific tasks.

Task-based model

Smoking challenge (2006)

A corresponding dataset configuration, i.e., smoking_challenge in the n2c2_datasets.py module allows for the creation of the task Dataset. As a first step data are organized in labeled train/test sentences through the create_task_datasets.py script.

Run:

python -m create_task_datasets name_challenge raw challenge_folder (e.g., challenge_folder=2006_smoking_status) or to create the task dataset for the non-redundant notes (raw);

python -m create_task_datasets name_challenge synthetic challenge_folder (e.g., challenge_folder=2006_smoking_status) to create the task datasets from the synthetic notes.

To prepare the DatasetDict for the newly pretrained ClinicalBert fine-tuning run:

sh create_finetuning.sh

For the fine-tuning step:

sh note_classification.sh

Cohort selection (2018 task 1)

References

[1] Uzuner, Ö., Goldstein, I., Luo, Y., & Kohane, I. (2008). Identifying patient smoking status from medical discharge records. Journal of the American Medical Informatics Association, 15(1), 14-24.

[2] Uzuner, Ö. (2009). Recognizing obesity and comorbidities in sparse data. Journal of the American Medical Informatics Association, 16(4), 561-570.

[3] Uzuner, Ö., Solti, I., & Cadag, E. (2010). Extracting medication information from clinical text. Journal of the American Medical Informatics Association, 17(5), 514-518.

[4] Uzuner, Ö., South, B. R., Shen, S., & DuVall, S. L. (2011). 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association, 18(5), 552-556.

[5] Uzuner, O., Bodnari, A., Shen, S., Forbush, T., Pestian, J., & South, B. R. (2012). Evaluating the state of the art in coreference resolution for electronic medical records. Journal of the American Medical Informatics Association, 19(5), 786-791.

[6] Sun, W., Rumshisky, A., & Uzuner, O. (2013). Evaluating temporal relations in clinical text: 2012 i2b2 Challenge. Journal of the American Medical Informatics Association, 20(5), 806-813.

[7] Kumar, V., Stubbs, A., Shaw, S., & Uzuner, Ö. (2015). Creation of a new longitudinal corpus of clinical narratives. Journal of biomedical informatics, 58, S6-S10.

[8] Stubbs, A., Filannino, M., Soysal, E., Henry, S., & Uzuner, Ö. (2019). Cohort selection for clinical trials: n2c2 2018 shared task track 1. Journal of the American Medical Informatics Association, 26(11), 1163-1171.

[9] Henry, S., Buchan, K., Filannino, M., Stubbs, A., & Uzuner, O. (2020). 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. Journal of the American Medical Informatics Association, 27(1), 3-12.

landiisotta / redundancy_transformers Goto Github PK

redundancy_transformers's Introduction

Clinical notes redundancy

Data

Pipeline

Create n2c2 dataset

Note tokenization

Create DatasetDict for BERT pretraining

Fine-tuning pretrained ClinicalBERT

Dataset loading

Step 1

Step 2

Task-based model

Smoking challenge (2006)

Cohort selection (2018 task 1)

References

redundancy_transformers's People

Contributors

Watchers

Forkers

redundancy_transformers's Issues

Recommend Projects

Recommend Topics

Recommend Org