Giter VIP home page Giter VIP logo

gram's Introduction

GRAM

GRAM is a prediction framework that can use the domain knowledge in the form of directed acyclic graph (DAG). Domain knowedge is incorporated in the training process using the attention mechanism. By introducing well established knoweldge into the training process, we can learn high quality representations of medical concepts that lead to more accurate predictions. The prediction task could take any form such as static prediction, sequence classification, or sequential prediction.

t-SNE scatterplot of medical concepts trained with the combination of RNN and Multi-level Clincial Classification Software for ICD9 (The color of the dots represent the most general description of ICD9 diagnosis codes) tsne

Relevant Publications

GRAM implements the algorithm introduced in the following paper:

GRAM: Graph-based Attention Model for Healthcare Representation Learning
Edward Choi, Mohammad Taha Bahadori, Le Song, Walter F. Stewart, Jimeng Sun  
Knowledge Discovery and Data Mining (KDD) 2017

Code Description

The current code trains an RNN (Gated Recurrent Units) to predict, at each timestep (i.e. visit), the diagnosis codes occurring in the next visit. This is denoted as Sequential Diagnoses Prediction in the paper. In the future, we will relases another version for making a single prediction for the entire visit sequence. (e.g. Predict the onset of heart failure given the visit record)

Note that the current code uses Multi-level Clinical Classification Software for ICD-9-CM as the domain knowledge. We will release the one that uses ICD9 Diagnosis Hierarchy in the future.

Running GRAM

STEP 1: Installation

  1. Install python, Theano. We use Python 2.7, Theano 0.8.2. Theano can be easily installed in Ubuntu as suggested here

  2. If you plan to use GPU computation, install CUDA

  3. Download/clone the GRAM code

STEP 2: Fastest way to test GRAM with MIMIC-III

This step describes how to run, with minimum number of steps, GRAM for predicting future diagnosis codes using MIMIC-III.

  1. You will first need to request access for MIMIC-III, a publicly avaiable electronic health records collected from ICU patients over 11 years.

  2. You can use "process_mimic.py" to process MIMIC-III dataset and generate a suitable training dataset for GRAM. Place the script to the same location where the MIMIC-III CSV files are located, and run the script. Instructions are described inside the script.

  3. Use "build_trees.py" to build files that contain the ancestor information of each medical code. This requires "ccs_multi_dx_tool_2015.csv" (Multi-level CCS for ICD9), which can be downloaded from here. Running this script will re-map integer codes assigned to all medical codes. Therefore you also need the ".seqs" file and the ".types" file created by process_mimc.py. The execution command is python build_trees.py ccs_multi_dx_tool_2015.csv <seqs file> <types file> <output path>. This will build five files that have ".level#.pk" as the suffix. This will replace the old ".seqs" and ".types" files with the correct ones. (Tian Bai, a PhD student from Temple University found out there was a problem with the re-mapping issue, which is now fixed. Thanks Tian!)

  4. Run GRAM using the ".seqs" file generated by build_trees.py. The ".seqs" file contains the sequence of visits for each patient. Each visit consists of multiple diagnosis codes. Instead of using the same ".seqs" file as both the training feature and the training label, we recommend using ".3digitICD9.seqs" file, which is also generated by process_mimic.py, as the training label for better performance and eaiser analysis. The command is python gram.py <seqs file> <3digitICD9.seqs file> <tree file prefix> <output path>.

STEP 3: How to pretrain the code embedding

For sequential diagnoses prediction, it is very effective to pretrain the code embeddings with some co-occurrence based algorithm such as word2vec or GloVe In the paper, we use GloVe for its speed, but either algorithm should be fine. Here we release codes to pretrain the code embeddings with GloVe.

  1. Use "create_glove_comap.py" with ".seqs" file, which is generated by build_trees.py. (Note that you must run build_trees.py first before training the code embedding) The execution command is python create_glove_comap.py <seqs file> <tree file prefix> <output path>. This will create a file that contains the co-occurrence information of codes and ancestors.

  2. Use "glove.py" on the co-occurrence file generated by create_glove_comap.py. The execution command is python glove.py <co-occurrence file> <tree file prefix> <output path>. The embedding dimension is set to 128. If you change this, be careful to use the same value when training GRAM.

  3. Use the pretrained embeddings when you train GRAM. The command is python gram.py <seqs file> <3digitICD9.seqs file> <tree file prefix> <output path> --embed_file <embedding path> --embed_size <embedding dimension>. As mentioned above, be sure to set the correct embedding dimension.

STEP 4: How to prepare your own dataset

  1. GRAM's training dataset needs to be a Python Pickled list of list of list. Each list corresponds to patients, visits, and medical codes (e.g. diagnosis codes, medication codes, procedure codes, etc.) First, medical codes need to be converted to an integer. Then a single visit can be seen as a list of integers. Then a patient can be seen as a list of visits. For example, [5,8,15] means the patient was assigned with code 5, 8, and 15 at a certain visit. If a patient made two visits [1,2,3] and [4,5,6,7], it can be converted to a list of list [[1,2,3], [4,5,6,7]]. Multiple patients can be represented as [[[1,2,3], [4,5,6,7]], [[2,4], [8,3,1], [3]]], which means there are two patients where the first patient made two visits and the second patient made three visits. This list of list of list needs to be pickled using cPickle. We will refer to this file as the "visit file".

  2. The label dataset (let us call this "label file") needs to have the same format as the "visit file". The important thing is, time steps of both "label file" and "visit file" need to match. DO NOT train GRAM with labels that is one time step ahead of the visits. It is tempting since GRAM predicts the labels of the next visit. But it is internally taken care of. You can use the "visit file" as the "label file" if you want GRAM to predict the exact codes. Or you can use a grouped codes as the "label file" if you are okay with reasonable predictions and want to save time. For example, ICD9 diagnosis codes can be grouped into 283 categories by using CCS groupers. We STRONGLY recommend that you do this, because the number of medical codes can be as high as tens of thousands, which can cause not only low predictive performance but also memory issues. (The high-end GPUs typically have only 12GB of VRAM)

  3. Use the "build_trees.py" to create ancestor information, using the "visit file". You will also need a mapping file between the actual medical code names (e.g. "419.10") and the integer codes. Please refer to Step 2 to learn how to use "build_trees.py" script.

STEP 5: Hyper-parameter tuning used in the paper

This document provides the details regarding how we conducted the hyper-parameter tuning for all models used in the paper.

gram's People

Contributors

chadyuu avatar mp2893 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gram's Issues

Label

Hi,I have discovered that your training label and training feature are the same.Your purpose is to predict the code of the next visit , so I suppose the training label should be one time later than the training feature.

Null Level two

Excuse me, just an easy question. The ".level2.pk" dictionary is always null after running build_tree.py following the instruction without any modification. Does anyone encounter similar problem or any is there mistake to cause it?

gradient with Theano

Thanks for your paper about Medical Prediction: GRAM.
I'm using your code to learn about how it works. But it has a problem as follow.

WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions. initializing parameters loading data building models OrderedDict([('w', w), ('w_tilde', w_tilde), ('b', b), ('b_tilde', b_tilde)]) Traceback (most recent call last): File "glove.py", line 163, in <module> train_glove(infile, inputSize=inputDimSize, batchSize=batchSize, dimensionSize=embDimSize, maxEpochs=maxEpochs, outfile=outfile) File "glove.py", line 119, in train_glove grads = T.grad(cost, wrt=tparams.values) File "/usr/local/lib/python3.8/dist-packages/theano/gradient.py", line 501, in grad raise TypeError("Expected Variable, got " + str(elem) + TypeError: Expected Variable, got <built-in method values of collections.OrderedDict object at 0x7f778bf1c1c0> of type <class 'builtin_function_or_method'>

That is the result when I run glove.py. I don't know the reason why it happens.
Thank you for your attention.

Le Ngoc Duc.

Domain knowledge graph issue?

Dear Choi,
I do love your concept of graph-based attention model for healthcare. And I am trying to do some works around it.
First thing first, I do get stuck with the graph generated. I tried to using MIMIC as a demo, but the generated graph shows, well, perhaps very detailed example. One patient that has Cancer ICD9 code was mapped to infectious and parasitic diseases. PID 124, old types 186, new types 50, ICD9: V1011 and 101. Due my limited knowledge, I dont think this may correct, and did I miss anything?
Further let me know if you cant reproduce it.
Secondly, I moved your implementation to Tensorflow, it run successfully, maybe later I will do a pull request for your check the codes.
Looking forward to hearing from you.
Regards,
Shen

Empty level1.pk when working with the new version of MIMIC

Hi,
I have been trying to run your code by following the instruction (yet working with ICD9 hierarchy rather than CCS which I'm sure works fine). However, it turned out that for the new version of the MIMIC the generated level1.pk would be empty, so I have been getting errors from gram.py as it assumes all levelX.pk files are non-empty to construct the tree and the attention model. Can you please help me with that (as it can be a common case in EHR where there could be no higher-level ICD9s assigned to the patients)?
Thanks in advance

Dimensions not matching?

Hi Edward,

I'm trying to reproduce GRAM results using MIMIC-III data.
If I understand correctly, there are 4894 medical codes used to represent patient visits. So the G matrix (from the paper) has to be of size 4894 x 128 (embedding dimension). However, there are no matrices of that size stored as a result of running gram.py.

Am I missing something or am I supposed to be deriving the G matrix with the help of other stored files? I tried to do this too but the dimensions just don't seem to be matching. Any help will be highly appreciated.

Thanks!

How to calculate accuracy@20 in each frequency group?

Hi,Choi:
I have some problems about how to calculate the accuracy@k score of each frequency group. I don't know which of the following two is right: 1.For each frequency group, the top20 score are selected to compare with the real label,and then calculate accuracy@20 individually. 2.Select the in the top 20 index of all labels,and then calculate which group the 20 indexs belong to,and compare with the labels. I hope you can help me if you know.

Many thanks,
Oldpants

some question about level2.pk and ancestors

Hello, sorry to bother you.
Recently I was researching your thesis and source code, and I want to apply your model to other fields.
But since I am not a medical professional, I find it very difficult to view the CCS forms and the .pk files generated by build_trees.
What I want to ask you is: why the inputSize of train_glove () in glove.py depends on the [0] [1] element in level2.pk file generated by build_trees.py instead of level1.pk or level3.pk, etc. ?
My question may be a bit stupid, but if you can enlighten me, thank you very much!!

function arguments

I'm wondering why this code snippet for index in random.sample(range(n_batches) , n_batches) in the train_GRAM function in gram.py is passed two arguments - range(n_batches) and n_batches, instead of max. one from np.random.sample documentation. Its not working for me at least. Any comment will be highly appreciated.

error running gram.py

Hello,

I followed all the previous steps, but when I ran gram.py I get the following error

Traceback (most recent call last):
File "gram.py", line 406, in
numAncestors = get_rootCode(args.tree_file+'.level2.pk') - inputDimSize + 1
File "gram.py", line 397, in get_rootCode
rootCode = tree.values()[0][1]
IndexError: list index out of range

Do you have any idea why it's showing that? Any help would be really appreciated.

Thanks.

query about the num of ancestors

hello,I wonder that how to deal with the problem that a child has different numbers of ancestor?Is there a mask?

Looking for your kind reply, thanks very much!

Description of arguments

Hello Edward,
Can you give a description of the format of the arguments:

if __name__ == '__main__':
	seqFile = sys.argv[1]
	treeFile = sys.argv[2]
	labelFile = sys.argv[3]
	outPath = sys.argv[4]

comparison between med2vec and gram

Hello Ed,

Nice work!
I didn't pay much attention to this paper at the beginning since you mentioned in the paper this method works well when the dataset is small. So I though Med2vec will give us a better performance when we have a large dataset.

However, now I look closer to the paper, it seems that GRAM will have a better performance than Med2vec and non-negative skip-gram as the t-SNE scatterplot for GRAM looks much better (dots are separated) compare to the other 2 methods.

On the other hand, since medical vector trained by GRAM is aligned well with the given knowledge DAG, which is made by human and might not be good. As you mentioned in the Med2Vec: "the degree of conformity of the code representations to the groupers does not neces-sarily indicate how well the code representations capture the hidden relationships"

I wonder how will you compare these 2 (or 3 if you count non-negative skip-gram) vector learning methods if given a large enough dataset?

Thanks!
xianlong

Needing the code for calculating metrics

Could you please upload your code about calculating the metric "accuracy @ k" when getting y_hat and y?

I am not familiar with theano and I will appreciate you that if you can upload corresponding code, then I can do some comparative experiment based on your code.

Thanks a lot!

query about "def padMatrix"

Hi! I wonder that what is the definition of "mask" in function def padMatrix(...)?

Looking for your kind reply, thanks very much!

Low-frequency labels hard to predict

Hello Dr. Choi,

Thanks for the nice work. I generated the CCS single-level labels as the target, and used your code to predict them. All hyperparameters are set according to the appendix. I group labels to five groups according to their frequencies (first rank all labels by their frequencies, and then equally divide them into five groups). But my results have some differences from those in the paper. I got [0, 0.01835, 0.0811, 0.3042, 0.8263] accuracies for the five groups, respectively. I noticed that I got higher accuracies for high-frequency labels, but cannot match the paper's accuracies for labels with frequency percentile [0-60]. Is there anything I have done wrongly? Furthermore, I found the frequency of labels in the first group (rarest) is only 0.16% out of all labels' frequencies (163/96677). I am wondering is this the correct way to divide to five groups?

Thanks,
Muhan

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.