mp2893 / med2vec Goto Github PK

View Code? Open in Web Editor NEW

219.0 219.0 74.0 19 KB

Repository for Med2Vec project

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

med2vec's People

Contributors

Stargazers

Watchers

med2vec's Issues

How to tune parameters to avoid cost:nan?

Using our own data from EHR and default parameters of med2vec, the cost went nan in epoch 1. Which parameter should I adujst to avoid such things happen? Enhance L2 or set a bigger log_eps? We have in total over 100 thousand batches, do we need to set a bigger batch_size?

Questions about experiments

Hello, thank you for your code available.
You mentioned about two experiments in your paper, but actually, I don't understand how to do these experiments.
Could you please tell me how to do these experiments clearly?
Thank you!

Negative Code Embeddings

Hello Ed,

In the Med2Vec code , you have mentioned the weights as (-0.01,0.01) ,which is generating negative code embeddings.
params['W_emb'] = np.random.uniform(-0.01, 0.01,

however in the paper you have mentioned that "all medical codes C to non-negative real-valued vectors of dimension m"

Can you please help me in understanding that?

Thanks,
Ankit

visit representation evaluate result on mimic3

Hello choi, thanks for sharing the code on github, it is a great topic.

After reading several your papers, I have a few questions:

Do you have the visit representation evaluate result on mimic3? Compare with your GRAM model, which one have a better performance? (I ask this because on CHOA, the recall@30 is around 76%, while in GRAM paper on mimic3, the accuracy@20 is relatively low, like 30% on average)
When you learn the vector representation of medical concepts, you want these vector eventually under the same common space. But is it make sense to treat them under the same common space in the first place? for example, you make one dictionary for procedure codes, diagnosis codes and medication codes, and then make one one-hot vector for all these codes.

Thanks

Where to download the dataset described in your paper? Are they all publicly available?(I open the link provided in your paper but can not find the download link)

Where to download the dataset described in your paper? Are they all publicly available?

Mapping embeddings to ICD codes

Hi Dr. Choi,

Thank you for sharing your work on Github!

Could you please tell me where I find the mapping between the ICD codes and the embeddings? I was testing med2vec on demo MIMIC data and W_emb is an array of dimensions 4894*200, where 4894 are the unique ICD codes, could you please let me know where I can find the mapping between W_emb and the corresponding ICD code names?

Thanks a lot!

Scatter plot from learned code representations

Hello Ed,

In Med2Vec, after creating the model file, you have created a 2D scatter plot using learned code representations. Is there any grouping is performed between the medical codes after creating the model file for scatter plot?

Because in High charts, the coloring is done based on some grouping.
example:
https://jsfiddle.net/gh/get/library/pure/highcharts/highcharts/tree/master/samples/highcharts/demo/scatter/

I have tried to create scatter plot after performing TSNE on embedding. It is created but there is no grouping, the colors are randomly placed. Cluster does not formed.

Can you please help me in understanding this?

Thanks,
SathickIbrahim

Negative Visit Forward Cross-Entropy on MIMIC-III

First of all, thank you for making your code available. This is a very interesting line of research.

In order to better understand your work, I rewrote med2vec in Python with TensorFlow rather than Theano. I then compared my results to yours on the MIMIC-III data set, with the same parameters used, expecting the results to be close. However, I discovered some issues:

In the calculation of visit forward cross-entropy, there are negative values. This leads to some cancellations and hence a visit cost of 300-400. Should negative values be considered here? In Tensorflow, the negative values are mapped to 0, giving a visit cost of ~4,000. What sort of cost values have you seen on MIMIC-III and other data sets?
My emb cost values are roughly equal to yours, but the value is < 10. Since visit cost >> emb cost, won't total cost just optimize for visits and not emb?

Below are some images related to visit cost calculation.

I'd be happy to make my code available to you, if you like. Thank you for your continued work on medical data analytics and I look forward to hearing back from you.

GPU training fails

I am getting an error when trying to do the training on the GPU. There are 47108 unique codes (quite a lot more than in the Mimic) but I am still getting an error even if I am using code and visit representations of just 5 dimensions and batch size of just 1 so I don't believe it is an out of memory problem. That is of course if my math is right:
2 Dense vectors of 47108 doubles: just 1 mb
47108x5 matrix for Dense to Code representation: 2 mb
5x5 matrix for Code represnetation to Visit: 200 bytes

Any help will be appreciated!

Using gpu device 0: Tesla P100-SXM2-16GB (CNMeM is enabled with initial size: 80.0% of memory, cuDNN 5110)
initializing parameters
building models
loading data
training start
[[ 0. 0. 0. ..., 0. 0. 0.]]
Traceback (most recent call last):
File "med2vec.py", line 323, in
train_med2vec(seqFile=args.seq_file, demoFile=args.demo_file, labelFile=args.label_file, outFile=args.out_file, numXcodes=args.n_input
_codes, numYcodes=args.n_output_codes, embDimSize=args.cr_size, hiddenDimSize=args.vr_size, batchSize=args.batch_size, maxEpochs=args.n_ep
och, L2_reg=args.L2_reg, demoSize=args.demo_size, windowSize=args.window_size, logEps=args.log_eps, verbose=args.verbose)
File "med2vec.py", line 290, in train_med2vec
cost = f_grad_shared(x, mask, iVector, jVector)
File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 898, in call
storage_map=getattr(self.fn, 'storage_map', None))
File "/usr/local/lib/python2.7/dist-packages/theano/gof/link.py", line 325, in raise_with_op
reraise(exc_type, exc_value, exc_trace)
File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 884, in call
self.fn() if output_subset is None else
RuntimeError: Cuda error: GpuElemwise node_ea5fafbdcfd074e674342684e5c33a10_0 Exp: an illegal memory access was encountered.
n_blocks=30 threads_per_block=256
Call: kernel_Exp_node_ea5fafbdcfd074e674342684e5c33a10_0_Ccontiguous<<<n_blocks, threads_per_block>>>(numEls, i0_data, o0_data)
Apply node that caused the error: GpuElemwise{Exp}(0, 0)
Toposort index: 35
Inputs types: [CudaNdarrayType(float32, matrix)]
Inputs shapes: [(47108, 47108)]
Inputs strides: [(47108, 1)]
Inputs values: ['not shown']
Outputs clients: [[GpuCAReduce{add}{0,1}(GpuElemwise{Exp}[(0, 0)].0), GpuElemwise{Mul}[(0, 1)](GpuDimShuffle{0,x}.0, GpuElemwise{Exp}[(0,
0)].0)]]

Edit: I also checked that 47108 codes do not cause problems by using only 10 codes and adding x[idx][np.array(seq)%numXcodes] = 1. in padMatrix

How to make demo.txt

Hi Ed,

I appreciate the code you provided . But there is a problem that makes me confused.

According to Step3.5 , Could you please tell me some details about how to make 'demo.txt' and add this codes about how to create to the file process_mimic.py

Thanks!

interesting topic

It's quite interesting if this approach could be used for matching the ICD codes based different language version or do sort of machine medical concepts translation based on the term representation.

Cost and Weights are NAN

Hello,

During training cost goes to NAN probably because one of the weights becomes too large and data goes out of bounds of float32. This causes all other weights to become NAN as well. I think classic way to deal with is to add Batch Normalization layers which clips large updates to weights however my limited understanding of Theano and your script prevents me from testing it out... Also cost seems quite high- have you seen similar values with your training? Let me know your thoughts on this:

Cannot able to Interpret Output of npz model File

Hello Ed,
While testing Med2vec to MIMIC database cannot able to Interpret the output of the model file.Whether these are model weights or predicted neighbour visit.

Please Try to clarify my doubt!
Thank You

Can you give some data for test and detail them?

for example, in [5,8,15],code 5 at a certain visit is what? and 8 , 15? thanks!

output file

Hi Edward,
I have run your code on my dataset and got the .npz files. I find it contains 6 numpy.array variables W_emb b_output b_hidden b_emb W_output W_hidden. But on the GitHub Repo I can’t find further description about these output variables. Can you give some detailed instruction about the output? How can I get the code and visit representation?

Output model/weights?

Hi, I read your paper and wanted to ask if the trained (on MIMIC-III) model/W2V can be downloaded directly - I'd like to see and to try it on our internal medical data before setting up resources to try to train a new model. (And training a new model from scratch would take weeks - It's a common practice to share the base model).

thanks!

Epochs and loss during training

Hi Ed,

I am training embedding using your default hyperparameters, except window_size. The minimum number of visits in my dataset is 2, but I set window_size=3 as I suppose your code can handle the inconsistency between window_size and actual sequence length. Am I right?

I also noticed that the mean_cost was the minimum at the 2nd epoch then it started increasing. Although I read in your paper that the number of epochs does not hurt the code representations very much, I am not sure which epoch should I choose after finished training. Should I used the minimum cost one, or the one from the last epoch?

TyperError: Expected Variable, got odict values

Hi, thank you for the awesome paper. I've been very interested in getting med2vec up and running using just the 3 required parameters for starters.

With my pickled sequence list looking like: [[1,2,3], [4,5,6,7], [-1], [2,4], [8,3,1], [3]]
I get the following error:

File "test_med2vec.py", line 248, in train_med2vec
grads = T.grad(cost, wrt=tparams.values())

TypeError: Expected Variable, got odict_values([W_emb, b_emb, W_hidden, b_hidden, W_output, b_output]) of type <class 'odict_values'>

Here's my run syntax:

python3 test_med2vec.py 'seq.pkl' 8 'med2vec_fin'

Any idea why it's not liking the dictionary values? Thank you for your time, if you're able to help.

Questions about complexity analysis

Hi Ed,

As mentioned in your paper, "Therefore the complexity of Med2Vec is dominated by the code representation learning process, for which we use the Skip-gram algorithm".

I know you use grouper/parent codes to decrease the complexity of visit-level learning process. But it seems that you didn't do much on the code-level part.

Is there a reason why you do not use methods like negative sampling to decrease the complexity of code level learning process?

Thanks
Xianlong

Where I can find the AHFS classification table?

Hello Choi,

As you mentioned in the paper, you are using AHFS classification to group the NDC codes. I wonder if you still have that mapping table? (Or direct me a some way to find the table)
I am doing a related work but can't find the table anywhere online.

Thank you!

high training cost

Hi Edward,
While I was searching for a new research idea, I've found your model and it was interesting in that it can learn code- and visit-level representation from EHRs simultaneously.

Using your model, I'm trying to learn embeddings that can represent measurements other than medical codes. However, the training cost seems quite high (around 150~250) and it doesn't converge(or go below 1) just like other models. I've found that the others have the same range of cost, but I wonder what was the final cost at the end of training.

Is it natural for this model to have this kind of high cost at the end of training? or is something wrong with a setting? I've adjusted the parameters in the model, but cost 170 was the best I could get.

I would appreciate your help.

Interpretation of learned representations

Hello Ed @mp2893,

This is super interesting work! I have two questions regarding the interpretation of learned representations.

In Section 3.5 - Interpreting code representations, the top k medical codes from each embedding dimension are selected to check if they are clinically related. However, by using skip-gram, according to this post, I think we should use cosine similarity to group medical codes but not the magnitudes of the values on specific embedding dimension. I think it is the angle between different medical codes that matter, not the magnitudes of the values on specific dimensions.
I have a similar question regarding interpreting visit representations. Specifically, why is it meaningful to compare the magnitudes of a specific dimension in the visit embedding space?

Thank you very much!

questions about the training data format

hi there, thanks a lot for making the code available, it helps me a lot to understand you paper.
I have a question about the format of training data. In README.md, step 3, when describing how to prepare the training data, each visit is said to be represented by a list of integers, such as [5,8,13]. In the closed issue "'output file" (#7), you answered TheodoreZhao's question, and said that "For visit representation, you can derive the code-level representation using u_t = ReLU(W_emb x_t + b_emb), possibly with a multi-hot vector, then use v_t = ReLU(W_hidden u_t + b_hidden) to derive the visit representation ".

so my question is, if I represent a visit with a list of integers in the training step, and compute a visit representation based on a multi-hot vector, what's the relationship between the list and the multi-hot vector? how to get the multi-hot vector for each visit?

thanks.

NaN gradient may be due to weight initialization

Hi Ed,

I saw in your code, the weights are initialized with truncated normal distribution. When I ran it, it seemed in the medical-code-loss part, this produced large values feeding to exp and resulted in inf in the loss and NaN gradients. Also because of such initial weights, the loss in general is pretty high around several hundreds, especially L2 loss is around tens of thousands. Then I changed the weight initialization to be uniform with a small interval [-0.1, 0.1]. That seems to produce reasonable magnitude of loss (under 10). I wonder if you still remember whether you have tried other weight initializations and how they impact the results.

Another question I have is that in the paper, the loss is averaged over T. Is this T visits in the batch or visits per patient? In your code, it seems, your ivec and jvec are generated for the batch. So in the medical-code-loss calculation, it is averaging over all visits in a batch, instead of averaging per patient and then averaging over all patients in a batch?

Thanks!

mp2893 / med2vec Goto Github PK

med2vec's People

Contributors

Stargazers

Watchers

Forkers

med2vec's Issues

Recommend Projects

Recommend Topics

Recommend Org