mp2893 / med2vec Goto Github PK
View Code? Open in Web Editor NEWRepository for Med2Vec project
License: BSD 3-Clause "New" or "Revised" License
Repository for Med2Vec project
License: BSD 3-Clause "New" or "Revised" License
Using our own data from EHR and default parameters of med2vec, the cost went nan in epoch 1. Which parameter should I adujst to avoid such things happen? Enhance L2 or set a bigger log_eps? We have in total over 100 thousand batches, do we need to set a bigger batch_size?
Hello, thank you for your code available.
You mentioned about two experiments in your paper, but actually, I don't understand how to do these experiments.
Could you please tell me how to do these experiments clearly?
Thank you!
Hello Ed,
In the Med2Vec code , you have mentioned the weights as (-0.01,0.01) ,which is generating negative code embeddings.
params['W_emb'] = np.random.uniform(-0.01, 0.01,
however in the paper you have mentioned that "all medical codes C to non-negative real-valued vectors of dimension m"
Can you please help me in understanding that?
Thanks,
Ankit
Hello choi, thanks for sharing the code on github, it is a great topic.
After reading several your papers, I have a few questions:
Do you have the visit representation evaluate result on mimic3? Compare with your GRAM model, which one have a better performance? (I ask this because on CHOA, the recall@30 is around 76%, while in GRAM paper on mimic3, the accuracy@20 is relatively low, like 30% on average)
When you learn the vector representation of medical concepts, you want these vector eventually under the same common space. But is it make sense to treat them under the same common space in the first place? for example, you make one dictionary for procedure codes, diagnosis codes and medication codes, and then make one one-hot vector for all these codes.
Thanks
Where to download the dataset described in your paper? Are they all publicly available?
Hi Dr. Choi,
Thank you for sharing your work on Github!
Could you please tell me where I find the mapping between the ICD codes and the embeddings? I was testing med2vec on demo MIMIC data and W_emb is an array of dimensions 4894*200, where 4894 are the unique ICD codes, could you please let me know where I can find the mapping between W_emb and the corresponding ICD code names?
Thanks a lot!
Hello Ed,
In Med2Vec, after creating the model file, you have created a 2D scatter plot using learned code representations. Is there any grouping is performed between the medical codes after creating the model file for scatter plot?
Because in High charts, the coloring is done based on some grouping.
example:
https://jsfiddle.net/gh/get/library/pure/highcharts/highcharts/tree/master/samples/highcharts/demo/scatter/
I have tried to create scatter plot after performing TSNE on embedding. It is created but there is no grouping, the colors are randomly placed. Cluster does not formed.
Can you please help me in understanding this?
Thanks,
SathickIbrahim
First of all, thank you for making your code available. This is a very interesting line of research.
In order to better understand your work, I rewrote med2vec in Python with TensorFlow rather than Theano. I then compared my results to yours on the MIMIC-III data set, with the same parameters used, expecting the results to be close. However, I discovered some issues:
In the calculation of visit forward cross-entropy, there are negative values. This leads to some cancellations and hence a visit cost of 300-400. Should negative values be considered here? In Tensorflow, the negative values are mapped to 0, giving a visit cost of ~4,000. What sort of cost values have you seen on MIMIC-III and other data sets?
My emb cost values are roughly equal to yours, but the value is < 10. Since visit cost >> emb cost, won't total cost just optimize for visits and not emb?
Below are some images related to visit cost calculation.
I'd be happy to make my code available to you, if you like. Thank you for your continued work on medical data analytics and I look forward to hearing back from you.
I am getting an error when trying to do the training on the GPU. There are 47108 unique codes (quite a lot more than in the Mimic) but I am still getting an error even if I am using code and visit representations of just 5 dimensions and batch size of just 1 so I don't believe it is an out of memory problem. That is of course if my math is right:
2 Dense vectors of 47108 doubles: just 1 mb
47108x5 matrix for Dense to Code representation: 2 mb
5x5 matrix for Code represnetation to Visit: 200 bytes
Any help will be appreciated!
Using gpu device 0: Tesla P100-SXM2-16GB (CNMeM is enabled with initial size: 80.0% of memory, cuDNN 5110)
initializing parameters
building models
loading data
training start
[[ 0. 0. 0. ..., 0. 0. 0.]]
Traceback (most recent call last):
File "med2vec.py", line 323, in
train_med2vec(seqFile=args.seq_file, demoFile=args.demo_file, labelFile=args.label_file, outFile=args.out_file, numXcodes=args.n_input
_codes, numYcodes=args.n_output_codes, embDimSize=args.cr_size, hiddenDimSize=args.vr_size, batchSize=args.batch_size, maxEpochs=args.n_ep
och, L2_reg=args.L2_reg, demoSize=args.demo_size, windowSize=args.window_size, logEps=args.log_eps, verbose=args.verbose)
File "med2vec.py", line 290, in train_med2vec
cost = f_grad_shared(x, mask, iVector, jVector)
File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 898, in call
storage_map=getattr(self.fn, 'storage_map', None))
File "/usr/local/lib/python2.7/dist-packages/theano/gof/link.py", line 325, in raise_with_op
reraise(exc_type, exc_value, exc_trace)
File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 884, in call
self.fn() if output_subset is None else
RuntimeError: Cuda error: GpuElemwise node_ea5fafbdcfd074e674342684e5c33a10_0 Exp: an illegal memory access was encountered.
n_blocks=30 threads_per_block=256
Call: kernel_Exp_node_ea5fafbdcfd074e674342684e5c33a10_0_Ccontiguous<<<n_blocks, threads_per_block>>>(numEls, i0_data, o0_data)
Apply node that caused the error: GpuElemwise{Exp}(0, 0)
Toposort index: 35
Inputs types: [CudaNdarrayType(float32, matrix)]
Inputs shapes: [(47108, 47108)]
Inputs strides: [(47108, 1)]
Inputs values: ['not shown']
Outputs clients: [[GpuCAReduce{add}{0,1}(GpuElemwise{Exp}[(0, 0)].0), GpuElemwise{Mul}[(0, 1)](GpuDimShuffle{0,x}.0, GpuElemwise{Exp}[(0,
0)].0)]]
Edit: I also checked that 47108 codes do not cause problems by using only 10 codes and adding x[idx][np.array(seq)%numXcodes] = 1.
in padMatrix
Hi Ed,
I appreciate the code you provided . But there is a problem that makes me confused.
According to Step3.5 , Could you please tell me some details about how to make 'demo.txt' and add this codes about how to create to the file process_mimic.py
Thanks!
It's quite interesting if this approach could be used for matching the ICD codes based different language version or do sort of machine medical concepts translation based on the term representation.
Hello,
During training cost goes to NAN probably because one of the weights becomes too large and data goes out of bounds of float32. This causes all other weights to become NAN as well. I think classic way to deal with is to add Batch Normalization layers which clips large updates to weights however my limited understanding of Theano and your script prevents me from testing it out... Also cost seems quite high- have you seen similar values with your training? Let me know your thoughts on this:
for example, in [5,8,15],code 5 at a certain visit is what? and 8 , 15? thanks!
Hi Edward,
I have run your code on my dataset and got the .npz files. I find it contains 6 numpy.array variables W_emb b_output b_hidden b_emb W_output W_hidden. But on the GitHub Repo I can’t find further description about these output variables. Can you give some detailed instruction about the output? How can I get the code and visit representation?
Hi, I read your paper and wanted to ask if the trained (on MIMIC-III) model/W2V can be downloaded directly - I'd like to see and to try it on our internal medical data before setting up resources to try to train a new model. (And training a new model from scratch would take weeks - It's a common practice to share the base model).
thanks!
Hi Ed,
I am training embedding using your default hyperparameters, except window_size. The minimum number of visits in my dataset is 2, but I set window_size=3 as I suppose your code can handle the inconsistency between window_size and actual sequence length. Am I right?
I also noticed that the mean_cost was the minimum at the 2nd epoch then it started increasing. Although I read in your paper that the number of epochs does not hurt the code representations very much, I am not sure which epoch should I choose after finished training. Should I used the minimum cost one, or the one from the last epoch?
Hi, thank you for the awesome paper. I've been very interested in getting med2vec up and running using just the 3 required parameters for starters.
With my pickled sequence list looking like: [[1,2,3], [4,5,6,7], [-1], [2,4], [8,3,1], [3]]
I get the following error:
File "test_med2vec.py", line 248, in train_med2vec
grads = T.grad(cost, wrt=tparams.values())
TypeError: Expected Variable, got odict_values([W_emb, b_emb, W_hidden, b_hidden, W_output, b_output]) of type <class 'odict_values'>
Here's my run syntax:
python3 test_med2vec.py 'seq.pkl' 8 'med2vec_fin'
Any idea why it's not liking the dictionary values? Thank you for your time, if you're able to help.
Hi Ed,
As mentioned in your paper, "Therefore the complexity of Med2Vec is dominated by the code representation learning process, for which we use the Skip-gram algorithm".
I know you use grouper/parent codes to decrease the complexity of visit-level learning process. But it seems that you didn't do much on the code-level part.
Is there a reason why you do not use methods like negative sampling to decrease the complexity of code level learning process?
Thanks
Xianlong
Hello Choi,
As you mentioned in the paper, you are using AHFS classification to group the NDC codes. I wonder if you still have that mapping table? (Or direct me a some way to find the table)
I am doing a related work but can't find the table anywhere online.
Thank you!
Hi Edward,
While I was searching for a new research idea, I've found your model and it was interesting in that it can learn code- and visit-level representation from EHRs simultaneously.
Using your model, I'm trying to learn embeddings that can represent measurements other than medical codes. However, the training cost seems quite high (around 150~250) and it doesn't converge(or go below 1) just like other models. I've found that the others have the same range of cost, but I wonder what was the final cost at the end of training.
Is it natural for this model to have this kind of high cost at the end of training? or is something wrong with a setting? I've adjusted the parameters in the model, but cost 170 was the best I could get.
I would appreciate your help.
Hello Ed @mp2893,
This is super interesting work! I have two questions regarding the interpretation of learned representations.
Thank you very much!
hi there, thanks a lot for making the code available, it helps me a lot to understand you paper.
I have a question about the format of training data. In README.md, step 3, when describing how to prepare the training data, each visit is said to be represented by a list of integers, such as [5,8,13]. In the closed issue "'output file" (#7), you answered TheodoreZhao's question, and said that "For visit representation, you can derive the code-level representation using u_t = ReLU(W_emb x_t + b_emb), possibly with a multi-hot vector, then use v_t = ReLU(W_hidden u_t + b_hidden) to derive the visit representation ".
so my question is, if I represent a visit with a list of integers in the training step, and compute a visit representation based on a multi-hot vector, what's the relationship between the list and the multi-hot vector? how to get the multi-hot vector for each visit?
thanks.
Hi Ed,
I saw in your code, the weights are initialized with truncated normal distribution. When I ran it, it seemed in the medical-code-loss part, this produced large values feeding to exp
and resulted in inf
in the loss and NaN
gradients. Also because of such initial weights, the loss in general is pretty high around several hundreds, especially L2 loss is around tens of thousands. Then I changed the weight initialization to be uniform with a small interval [-0.1, 0.1]
. That seems to produce reasonable magnitude of loss (under 10). I wonder if you still remember whether you have tried other weight initializations and how they impact the results.
Another question I have is that in the paper, the loss is averaged over T
. Is this T
visits in the batch or visits per patient? In your code, it seems, your ivec
and jvec
are generated for the batch. So in the medical-code-loss calculation, it is averaging over all visits in a batch, instead of averaging per patient and then averaging over all patients in a batch?
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.