Comments (11)
Thanks for your interest in this work.
Although, I don't exactly understand what you meant. Are you talking about projecting words and medical concepts (e.g. ICD9 codes, medication codes, procedure codes) to the same latent space?
We are considering similar approaches for our future work, but not just words but also other modalities of medical data.
from med2vec.
Just a few thoughts and they may not be possible in your case. What I meant was to match for example ICD-9 procedure codes to other version, different language or different medical classification codes (e.g. ICHI) by the description. It may require the semantic matching tecnologies I may wonder.
from med2vec.
I see. Let me re-iterate your goal just to see if I am not mistaken.
Given two different coding schemes for the same medical concept, such as ICD-9 procedure codes and CPT procedure codes, you want to see which ICD-9 procedure code corresponds to which CPT procedure codes (or vice versa)
In that case, it would be easy if you have two datasets where one dataset uses, for example, ICD-9 diagnosis codes and ICD-9 procedure codes, and another dataset uses ICD-9 diagnosis codes and CPT procedure codes. Using the first dataset, you can project ICD-9 diagnosis codes and the ICD-9 procedure codes in the same latent space. Using the second dataset, you can project ICD-9 diagnosis codes and the CPT procedure codes in another same latent space. Then, you can select one diagnosis code and retrieve k-nearst procedure codes respectively from each latent space and compare the retrieved procedure codes.
Of course, to use this approach, the two dataset need to consist of similar patients (if one dataset is of children and another of seniors, the distribution of medical codes won't match), similar patient size, similar ICD-9 diagnosis codes, etc. But this approach does not require you to compare the text descriptions of ICD-9 procedure codes and CPT procedure codes to see which corresponds to which, thus eliminating the need to use NLP techniques.
Otherwise, you can compare the text descriptions of ICD-9 procedure codes and the text description of CPT procedure codes and decide which corresponds to which, which is more straightforward. I think, with good medical NLP tools, this will yield better results, because I assume it won't be easy to find two similar datasets as mentioned above.
from med2vec.
Thank you for the inspriation. However, the first approach may require the same/similar granularity of the code schema for matching precisely. I'm wondering if your second proposal will work by using learning representation to encode the medical concepts based on their semantic information. If I'm right, your med2vec could be alternative to the exsiting coding system like ICD-9,10 and it will include not only the classification meaning but also the linguistic information. A previous work I have done is to use NLP to create a mapping table between ICD-9 and ICHI.
from med2vec.
My second proposal was actually more similar to your previous work (creating a mapping table between ICD-9 and ICHI).
But if you have a good way to embed the descriptions of medical codes (such as doc2vec, or any sentence embedding algorithm) then you can project, for example, ICD-9 procedure codes and CPT procedure codes to the same latent space using their descriptions.
This would enable you to find out which ICD-9 procedure codes are similar to which CPT procedure codes.
But this approach requires a pre-trained sentence embedding (or text embedding) model. Typically, embedding algorithms used in NLP (word2vec, doc2vec or other approaches) require a huge corpus to train on. But the descriptions for the medical codes are very limited. So it is unlikely to work if you train your embedding algorithm only with the code descriptions. You will need pre-train the embedding model using some large medical text, then apply the model to the code descriptions.
from med2vec.
Hello Ed,
I play around with med2vec model on mimic-3 for a few days, and I try @paulcx 's thoughts: Merge codes that are encode under different coding scheme. (this is one application I can think of to evaluate the quality of the vector of medical concepts)
I split the DRUG Code into 2 dataset to mimic two different hospital's data, and keep ICD code the same. After I ran the med2vec model, the performance is very good: I got 80% recall@8.
However, one thing I notice is that: the good result is achieved by the visit-level cost. If I only use the code-level cost, the performance will lower to 5% for recall@8, if I only use visit-level cost, the performance will still be around 80%.
Another thing is that the visit-level cost is much higher than code-level cost. which is reasonable consider the sigmoid function on visit-level, but will that cause the impact that: the model will focus much more on visit-level?
Given the above two things, my question is: During your experiment, do you think visit-level is the key of the success of the medical vector and code-level is not that important? Or do I miss anything?
Thanks!
from med2vec.
Hi Xianlong,
Generally, I wouldn't recommend running med2vec on MIMIC-III.
It is a very small dataset(about 45K patients in total), and there are probably only a couple thousand patients (or even less) that made at least three visits, because it's an ICU dataset. Therefore the visit-level softmax loss probably won't train well.
I uploaded process_mimic.py so that people can "try out" med2vec, not to obtain state-of-the-art performance with MIMIC-III.
And you even divided MIMIC-III into two parts, so the data size is even more problematic.
Now, to your findings:
When you say you got certain recall@8, I have so many questions to ask regarding your experiment setup. And when you say that the visit-level cost is much higher than code-level cost, I'd like to know how much. So if you have time, we can talk on Skype. I'm interested to learn your findings.
Please send me an email to [email protected].
BTW, you might be right about the balance between visit-level cost and code-level cost. Empirically, summing the two worked just fine. But if you can think of some clever way to balance the two and run experiments, it would be great to learn new findings.
Thanks,
Ed
from med2vec.
@1230pitchanqw Hi,I have almost same questions as @mp2893 asked. It would be nice if you can share your findings in details and we could talk about it.
from med2vec.
hello @paulcx,
Sorry for the late respond. I had a conversation with Ed yesterday and he gives me some valuable advises.
What I did was actually very simple: instead of training two different datasets separately, I trained them together. General idea can be seen below:
Form Ed's proposal: '''In that case, it would be easy if you have two datasets where one dataset uses, for example, ICD-9 diagnosis codes and ICD-9 procedure codes, and another dataset uses ICD-9 diagnosis codes and CPT procedure codes. Using the first dataset, you can project ICD-9 diagnosis codes and the ICD-9 procedure codes in the same latent space. Using the second dataset, you can project ICD-9 diagnosis codes and the CPT procedure codes in another same latent space. Then, you can select one diagnosis code and retrieve k-nearst procedure codes respectively from each latent space and compare the retrieved procedure codes.'''
Now for my findings so far: 1. this method only works when I use In-patient data (more codes per visit); 2. the code-level training is doing very little affect to the training; 3. even though using In-patient data to do the training can lead to a good result on this task, but the quality of the medical vectors is bad: the synonyms have a small cos-sim values (around 0.2, while in Ed's trained vectors they have around 0.9).
Thanks
from med2vec.
@1230pitchanqw Thanks for your insights. I'm wondering if this paper 'Earth Mover’s Distance Minimization for Unsupervised Bilingual Lexicon Induction' would help somehow.
from med2vec.
@paulcx
Thanks! That is very interesting tool to evaluate distances between different vocabularies.
Have you ran any experiments with this tool?
from med2vec.
Related Issues (20)
- TyperError: Expected Variable, got odict values HOT 4
- Negative Visit Forward Cross-Entropy on MIMIC-III HOT 1
- Questions about experiments HOT 1
- questions about the training data format HOT 3
- How to tune parameters to avoid cost:nan? HOT 1
- Where I can find the AHFS classification table? HOT 1
- Cannot able to Interpret Output of npz model File HOT 6
- Negative Code Embeddings HOT 2
- high training cost HOT 2
- Scatter plot from learned code representations HOT 16
- Epochs and loss during training HOT 3
- Mapping embeddings to ICD codes HOT 2
- NaN gradient may be due to weight initialization HOT 4
- Interpretation of learned representations
- How to make demo.txt
- GPU training fails HOT 5
- Cost and Weights are NAN HOT 2
- output file HOT 2
- Output model/weights? HOT 3
- Questions about complexity analysis HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from med2vec.