jamesmullenbach / caml-mimic Goto Github PK
View Code? Open in Web Editor NEWmultilabel classification of EHR notes
License: MIT License
multilabel classification of EHR notes
License: MIT License
Hi,
If I got it right, models.py > ConvAttnPool is the relevant model to the suggested CAML architecture in the article.
Looking in the forward function, I see that the last action occuring before calculating loss is linear (multiplying by final.weight & adding final.bias):
y = self.final.weight.mul(m).sum(dim=2).add(self.final.bias)
but theres no sigmoid after that, as suggested in the paper:
What did I miss?
Thanks :-)
Mor
@jamesmullenbach @sarahwie
Can you please tell me why you didn't remove words like admission date, discharge data, sex, etc.
If needed, can we remove those
Hi James,
I am trying to understand the prediction psv file. In the preds_dev.psv, why is it that each row has different length? Shouldn't each row have the same number of predicted ICD9 code (at least 15 because of the metric Precision@8 and Precision@15) ?
Thanks.
Just mention that while training a new model among the top 50 labels, we need to set the criterion as precision_at_5 instead of precision_at_8. Otherwise, in training.py
it will not save model.pth
file.
This is just a minor issue but might be helpful for other users.
The parameter Y of the training script is not used for building the model. Instead, the size of the label space is set in the method 'pick_model' from the dictionary computed during data processing.
This makes sense but then the parameter Y should be removed.
Hi,
Running predictions/DRCAML_mimic3_50/train_new_model.sh, I got such result:
evaluating on test
file for evaluation: ../../mimicdata/mimic3/test_50.csv[MACRO] accuracy: 0.363, precision: 0.557, recall: 0.465, f-measure: 0.501, AUC: 0.855
[MICRO] accuracy: 0.389, precision: 0.619, recall: 0.511, f-measure: 0.560, AUC: 0.881
prec_at_5: 0.553
rec_at_5: 0.523
The performance of DR-CAML I got above is much worse than the one in Table5.
I cannot reproduce the result of CAML as well, while CNN works quite well as in Table5.
Could you release the parameters or new scripts that can reproduce the performance in the paper ?
Thank you very much !
Hi James,
I'm trying to understand the code in your model.py.
I see that at line 106, you have self.final = nn.Linear(num_filter_maps, Y)
. The Y
is of dimension about 8930.
Please correctly me if I am wrong. As I understand, this nn.Linear is a trick, so that you can use the weight matrix so you can do pointwise multiplication in line 135 with y = self.final.weight.mul(m).sum(dim=2).add(self.final.bias)
.
nn.Linear is not used in the standard sense like in pytorch tutorial; for example, we would usually run self.final(some-input)
.
Is this correct?
Wondered if we can fully reproduce the log_reg.py? It seems that something missing here. Thanks!
Best,
A
Hi,
I have two questions regarding the CAML implementation:
Many thanks!
Hello, is there a plan to release informativeness annotations ? (span+ label + expert annotation)
how can i get code_emb
Hi @jamesmullenbach,
I'm getting an error while running the dataproc_mimic_III
notebook:
dataproc/concat_and_split.pyc in split_data(labeledfile, base_name)
61 for splt in ['train', 'dev', 'test']:
62 hadm_ids[splt] = set()
---> 63 with open('%s/%s_full_hadm_ids.csv' % (MIMIC_3_DIR, splt), 'r') as f:
64 for line in f:
65 hadm_ids[splt].add(line.rstrip())
IOError: [Errno 2] No such file or directory: '../mimicdata/mimic3/train_full_hadm_ids.csv'
The README.md states that these files are already in the repository:
| | *_hadm_ids.csv (already in repo)
However, it looks like they are not. Where can these files be found? Am I missing something?
the logistic regression part in training.py seems not complete.....
Hi,
Could you report the training time as well as your hardware specifications?
Some statistics such as: time per batch (for different batch_size) and time per epoch would be good to have since you have sequences of 2'500 words!
Hi,
I am having problem with this line 40 in the training.py
csv.field_size_limit(sys.maxsize)
The error says OverflowError: Python int too large to convert to C long
.
What do you thinking is causing this problem? Is it the total number of words (not unique words, but total count of words)?
Can I ask what was your memory usage?
Thanks.
Hi, from reading through the model.py. I see that the vector embedding of the label descriptions are trained jointly with the rest of the module. So "practically speaking", the 2nd module in section 2.5 of the paper is in fact trained jointly with the 1st module. Is this correct?
I guess to make the question more clear. When you say "2nd module" in the paper, you do not imply that the 2nd module is trained entirely independently of the "standard" model. Is this correct?
Thanks for your help.
They both make the top50 statistics changed. And with MIMIC-iii 1.4, the full label seems not to be 8922 anymore (if the ICD9_CODE is read as str)
I prepared the data following dataproc_mimic_III.ipynb file and i got six file i.e train_50, test_50, dev_50, train_full, test_full, dev_full. I am facing problem with train_full, test_full and dev_full such that train_full contain 8686 unique labels, test_full contain 4075 unique labels and dev_full contains 3009 unique labels. I don't know why labels are not of equal size in each file and now how to make them of equal size so that I can train my model.
kindly help me
Hi,
I trained the model based the code here. When I load my model back into python, I get an error.
This is how I call the command
training.py train_full.csv vocab.csv full conv_attn 100 --filter-size 10 --num-filter-maps 50 --dropout 0.2 --patience 10 --lr 0.0001 --test-model model_best_prec_at_8.pth --gpu --quiet
This is the error I get,
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 482, in load_state_dict
own_state[name].copy_(param)
RuntimeError: inconsistent tensor size, expected tensor [51919 x 100] and src [51920 x 100] to have the same number of elements, but got 5191900 and 5192000 elements respectively at c:\pytorch\torch\lib\th\generic/THTensorCopy.c:86
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "training.py", line 355, in <module>
main(args)
File "training.py", line 31, in main
args, model, optimizer, params, dicts = init(args)
File "training.py", line 48, in init
model = tools.pick_model(args, dicts)
File "C:/Users/dat/Dropbox/caml-mimic\learn\tools.py", line 36, in pick_model
model.load_state_dict(sd)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 487, in load_state_dict
.format(name, own_state[name].size(), param.size()))
RuntimeError: While copying the parameter named embed.weight, whose dimensions in the model are torch.Size([51919, 100]) and whose dimensions in the checkpoint are torch.Size([51920, 100]).
It seems that the vocab.csv does not match with the vocab in the trained model. Is this because the "unknown token" being added after the vocab.csv was made?
I feel that there is some strange mismatching here. I check the vocab.csv and it has 51917 lines.
wc -l vocab.csv
51917 vocab.csv
Thanks.
I am wondering if you could share the top 50 ICD codes used in your work? Did you use DIAGNOSES_ICD to extract ICD codes? I looked into CAML_mimic3_50/preds_test.psv and found some codes, such as 37.22 and 96.72 are not in the MIMIC-III data.
Everything fine in the notebook for mimic3 until:
tr, dv, te = concat_and_split.split_data(fname, base_name=base_name)
notes_labeled.csv
disch_full.csv
are OK, generated successfully but hadm_id = row[1] looks like there is an empty row somewhere in the header, no?
IndexError Traceback (most recent call last)
in
----> 1 tr, dv, te = concat_and_split.split_data(fname, base_name=base_name)
~\Documents\GitHub\caml-mimic\dataproc\concat_and_split.py in split_data(labeledfile, base_name)
75 print(str(i) + " read")
76
---> 77 hadm_id = row[1]
78
79 if hadm_id in hadm_ids['train']:
IndexError: list index out of range
Hi there,
I read your paper and you mentioned that you used padding to ensure the input and output of convolutional layer have the same length. However, in the code you set padding=int(kernel_size/2)
, which can not ensure that. For example, the input is 111
and the kernel_size is 4, the input after padding would be 0011100
and after the length of output would be 4.
I googled and found that pytorch seems to have no similar function as tensorflow's 'same' padding for convolutional layer. Is there any workaround to achieve this?
Thanks a lot.
Hello,
I have a question about the function precision_at_k in evaluation.py. I think the denominator should be the amount of 1 predictions made among the top k predictions, however, in the code, length of top k is used. For example, if only 1 true prediction in the top 5, the denominator should be 1 but in this case it would still be 5.
Here is my modification:
def precision_at_k(yhat, yhat_raw, y, k):
#num true labels in top k predictions / num 1 predictions in top k
sortd = np.argsort(yhat_raw)[:,::-1]
topk = sortd[:,:k]
#get precision at k for each example
vals = []
for i, tk in enumerate(topk):
if len(tk) > 0:
num_true_in_top_k = y[i,tk].sum()
denom = yhat[i,tk].sum()
if denom == 0: # in case no true predictions made in top k
vals.append(1)
else:
vals.append(num_true_in_top_k / float(denom))
return np.mean(vals)
Could you take a look at it? Correct me if I am wrong.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.