Hi Ed, I saw in your code, the weights are initialized with truncate

NaN gradient may be due to weight initialization about med2vec HOT 4 OPEN

mp2893 commented on September 26, 2024

NaN gradient may be due to weight initialization

from med2vec.

Comments (4)

mp2893 commented on September 26, 2024

Hi Victor,

Thanks for taking interest in my work.

As for your first question: No, I haven’t tried other initialization strategy. But I think your approach makes sense. Maybe care to contribute to the repo?

For the second question: IIRC (it was a long time ago I wrote this code) ivec and jvec are constructed from the preprocessed patient records so there is no concept of “patient” in the minibatch. There is just a bunch of random visits from the EHR.

Best
Ed

from med2vec.

victorconan commented on September 26, 2024

Hi Ed! Thanks for the reply! Very appreciate it!

I am transforming your code into TF2 and testing it. I will see if I can contribute to the repo. I am also comparing the results if I implement the code exactly as described in your paper. My data is larger (~2M patients, ~77k medical codes) and it seems to take 2.5 days to train 1 epoch on single CPU...

from med2vec.

mp2893 commented on September 26, 2024

Sounds interesting. Feel free to share any result from your experiments, so that others might gain new knowledge!

from med2vec.

victorconan commented on September 26, 2024

I got my 10 epochs of training done. And I found that 80% of the codes are all 0s embeddings (I am taking ReLU(W_emb))...In general the visit loss (~1e-3) is much smaller than the code loss (~10). It seems the co-occurrence loss is dominating the training? and it has difficulty learning for most of the codes.

Also I found transferring the code loss into TF 2 would have some issue when calculating the exponential terms. Taking the exponential of vector product would require the vector to be sparse. Otherwise the value would be very large:

emb_w = tf.maximum(emb_w, 0)
emb_w_transpose = tf.transpose(emb_w)
norms = tf.reduce_sum(tf.math.exp(tf.matmul(emb_w, emb_w_transpose)), axis=1)

i = tf.gather(emb_w_transpose, ivec, axis=1)
j = tf.gather(emb_w_transpose, jvec, axis=1)

numerator = tf.math.exp(tf.reduce_sum(j * i, axis=0))
denominator = tf.gather(norms, ivec)
cost = -tf.math.log(
       numerator / denominator 
        + eps
)
cost = tf.reduce_mean(cost)

So I switch to the below tensorflow function which will prevent inf loss:

norms = tf.matmul(emb_w, emb_w_transpose)

numerator = tf.reduce_sum(j * i, axis=0)
denominator = tf.math.reduce_logsumexp(tf.gather(norms, ivec), axis=1)
cost = - (numerator - denominator)
cost = tf.reduce_mean(cost)

And it's 3 times slower...

from med2vec.

NaN gradient may be due to weight initialization about med2vec HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent