emilyalsentzer / clinicalbert Goto Github PK

View Code? Open in Web Editor NEW

636.0 636.0 130.0 98 KB

repository for Publicly Available Clinical BERT Embeddings

License: MIT License

Python 50.98% Shell 2.74% Jupyter Notebook 41.60% Perl 4.68%

clinicalbert's People

Contributors

Stargazers

Watchers

Forkers

bharatr21 53x o0windseed0o neuece skhong0831 carrielui andrewpatterson2018 bigodatamining seamrvaulter shaw95 neemax chingheng113 nlngh mocherson mshapi2 ruiatelsevier bernaljg chiyuan1126 thorphan cemeiq ryannetwork tonydeep ypruksachatkun-asapp sqripter th0mi w2wei akschougule beverly0005 jamshaidsohail5 aashaybhupendradoshi lzanellac jidiazhernandez govind430 sidney1994 dmowery colabnlp matchading meganbarnes liuwenhaha kvu41 frederickxzhang neverneverendup yojanagadiya simonlevine youikim davidrivasphd mengzifds cranedroesch usuyama jenchen1398 wimara28 vinidixit code-cse rohitmehta0 chang111 jjnotjimmyjohn zzxslp reloadbrain vickyblake kavithacd shilongli0213 samholt cvsidhant-23 saqibmamoon zhaosheng-xie strategist922 dylansppy yyht ambikasadhu1101 trawely x-bioinformatics amyolex cphi-tvhs huyennguyenhelen frymiram alangulo melissayan szhong-ssmhealth w1074098501 casszhao douxiaotian 9kwon alaabashayreh oylumalatli asellerg sree181 kyleston ucanwinjatin antoine-guillouk linwhitehat g3rley 1c99 kartikgill sk4340 gayansamuditha lcagnina hafsah2018 vdeeplearn haoyu-hu bohdan-nd

clinicalbert's Issues

Clarification on Tokenizer for MIMIC

I don't see the BERT tokenizers utilized in the code for the MIMIC fine tuning (seems to be that these have a custom tokenizer code), and this doesn't appear to do the WordPiece tokenization used in the rest of BERT. You do appear to use the BERT-based tokenizer for the MedNLI task. Please clarify.

inquiry about cosine similarity between tokenized sentences

Hi there; you're honestly doing God's work here and sharing it on hugging face.

I am however very confused by how to appropriately use this tool. I was originally trying to tokenize sentences with the clinicalbert trained on discharge summaries; and tried to see if it was able to recognize similar medical terminologies and lump them together, or return high similarity words. So far, it seems like the base bert performs better. Would there ever be a world where your work gets extended into a stsb type of a model?

How to create note embeddings from pandas series using pretrained clinicalBERT+Discharge_summaries

Hello,
I would like to know how can Feature Vectors be generated from pandas series containing notes.
the notes of a single subject ID are combined as one note and preprocessed according to my requirements. Now I just want to create Embedding vectors for the notes. How can this be done?

Can't parse serialized Example

Looks like failed to load the model from the ckpt, any hint? thanks.

I0731 15:33:06.451731 140067755902720 basic_session_run_hooks.py:606] Saving checkpoints for 0 into /home/ec2-user/robin/clinicalBERT/output/model/model.ckpt.
2022-07-31 15:33:31.584083: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2022-07-31 15:33:31.795626: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_ids. Can't parse serialized Example.
2022-07-31 15:33:31.795633: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_mask. Can't parse serialized Example.
2022-07-31 15:33:31.795777: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_ids. Can't parse serialized Example.
2022-07-31 15:33:31.795937: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_ids. Can't parse serialized Example.
2022-07-31 15:33:31.795937: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_mask. Can't parse serialized Example.
2022-07-31 15:33:31.796025: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: masked_lm_ids. Can't parse serialized Example.
2022-07-31 15:33:31.796329: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: masked_lm_positions. Can't parse serialized Example.
2022-07-31 15:33:31.796348: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_ids. Can't parse serialized Example.
2022-07-31 15:33:31.796660: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_mask. Can't parse serialized Example.
2022-07-31 15:33:31.796796: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: segment_ids. Can't parse serialized Example.
2022-07-31 15:33:31.796889: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_mask. Can't parse serialized Example.
2022-07-31 15:33:31.796975: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: masked_lm_weights. Can't parse serialized Example.
2022-07-31 15:33:31.797047: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: masked_lm_positions. Can't parse serialized Example.
2022-07-31 15:33:31.797134: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_ids. Can't parse serialized Example.
2022-07-31 15:33:31.797213: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: masked_lm_positions. Can't parse serialized Example.
2022-07-31 15:33:31.797291: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_ids. Can't parse serialized Example.
ERROR:tensorflow:Error recorded from training_loop: 2 root error(s) found.
(0) Invalid argument: Key: input_ids. Can't parse serialized Example.
[[{{node ParseSingleExample/ParseSingleExample}}]]
[[IteratorGetNext]]
(1) Invalid argument: Key: input_ids. Can't parse serialized Example.
[[{{node ParseSingleExample/ParseSingleExample}}]]
[[IteratorGetNext]]
[[IteratorGetNext/_4973]]

Tokenizer is not cased

In your script at https://github.com/EmilyAlsentzer/clinicalBERT/blob/master/lm_pretraining/create_pretraining_data.py,
the do_lower_case is actually set to be "True".

So I went to load the model. When I checked your vocabulary, your vocabulary is a mixed of cased and uncased words since you inherit it from bioBERT. However, when I used your tokenizer to tokenize a sentence, I found out words will be lowered cased.

Do you mind clarifying this a bit? Thanks a lot.

Question about mimic data preparation process

Hi Emily,

After I acquire an access to MIMIC III database, I preprocess this data following your procedure (i.e. format_mimic_for_BERT.py).

But, I can not have a confidence about below results. Is it right result?

(after format_mimic_for_BERT.py)

Thanks
Young-Jun

Clinical BERT, initialized from BERTBase is not available

Thanks for making clinical bert publicly available. In the paper, it was mentioned liked "We train and publicly release BERT-Base and BioBERT-finetuned models trained on both all clinical notes and only discharge summaries".

But only BioBERT finetuned models are available . I would like to know, when will you release BERT based fine tuned models?

Weight initialization Error on using pretrained model in pytorch

I am getting this error

Weights of BertForMultiLable not initialized from pretrained model: ['classifier.weight', 'classifier.bias']
Weights from pretrained model not used in BertForMultiLable: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']

while using the provided pretrained biobert from here in this repo.
here is the issue for reference.

Is there some issue in the model?

Known issue with section splitting in heuristic_tokenize.py

There are two bugs in the sent_tokenize_rules function in heuristic_tokenize.py

We have not fixed them in this repo because we want to maintain the reproducibility of
our code at the time the work was published. However, anyone wanting to extend this work should make the following changes in heuristic_tokenize.py:

fix a bug on line #168 where . should be replaced with \. i.e. should be while re.search('\n\s*%d\.'%n,segment):
add else statement (else: new_segments.append(segments[i])) to the if statement at line 287 if (i == N-1) or is_title(segments[i+1]): This fixes a bug where lists that have a title header will lose their first entry.

Datasets

Hello @EmilyAlsentzer,

This is a great contribution to the open source community! I have read your paper thoroughly: https://www.aclweb.org/anthology/W19-1909.pdf

I have few questions:

I would love to try out both clinicalBERT + BioBERT on few downstream tasks (disease identification), however I donot have lot of training dataset (infact zero training datasets) . Could you please point me to some available open source data repositories which already have: notes --> disease, mapping?
I see you have used typical BERT pertaining approach(MLM), however I would like to explore other pertaining strategies such as (Replaced Token Detection, from ELECTRA etc.)
I also see for pertaining you have used MIMIC-III datasets, I dont have access to this dataset, to evaluate. What would you suggest for pertaining datasets?
I also would love try new variants of transformers (larger ones, low parameter ones) + do multitask learning , so datasets (de-ID, non PHI sufficient) seems to be bottleneck, how to over come this ?

Would open source all my work in py-torch, if I could find a tangible data source. Please let me know. Thanks!

About your Biobert pretrain model

Hi Emily,

I still have one question.

For your model pretrained from Biobert, which version of Biobert you are using?

From biobert: https://github.com/naver/biobert-pretrained, there are 4 versions.

Thank you : )

missing mli_train_v1.jsonl

Hi,
run_classifier.py is looking for this json file..mli_train_v1.jsonl
how to get or construct this file?

Couldnot Download Pretrained Weights file using axel or aria2

I tried Downloading using my preferred downloader axel and aria2 on ubuntu but it downloads blank file . Why so?

Unable to reproduce results

I can't seem to reproduce your results on MedNLI with the two released models using the same hyperparameters presented in your paper's Appendix B. You reported 84-85%, but I can only get to 81-82% on the test set. Do you know why? Are the reported results on the dev set or the test set? If relevant, I'm using the pytorch-pretrained-bert repo.

Which labels to use with transformers libary ?

Hello,

this looks like a great piece of work, thank you for making it available. I tried to explore clinicalBERT for some NER tasks using the transformers library. I can obtain a list of token index results from torch.argmax, but I cannot find a suitable a set of labels (predictions is containing values as large as 687). What am I doing wrong ?

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

label_list = ["B-IDNUM", "I-IDNUM", "B-HOSPITAL", "I-HOSPITAL", 'B-PATIENT', 'I-PATIENT', 'B-PHONE', 'I-PHONE',
        'B-DATE', 'I-DATE', 'B-DOCTOR', 'I-DOCTOR', 'B-LOCATION-OTHER', 'I-LOCATION-OTHER', 'B-AGE', 'I-AGE', 'B-BIOID', 'I-BIOID',
        'B-STATE', 'I-STATE','B-ZIP', 'I-ZIP', 'B-HEALTHPLAN', 'I-HEALTHPLAN', 'B-ORGANIZATION', 'I-ORGANIZATION',
        'B-MEDICALRECORD', 'I-MEDICALRECORD', 'B-CITY', 'I-CITY', 'B-STREET', 'I-STREET', 'B-COUNTRY', 'I-COUNTRY',
        'B-URL', 'I-URL',
        'B-USERNAME', 'I-USERNAME', 'B-PROFESSION', 'I-PROFESSION', 'B-FAX', 'I-FAX', 'B-EMAIL', 'I-EMAIL', 'B-DEVICE', 'I-DEVICE',
        'O', "X", "[CLS]", "[SEP]"]

sequence = "Patient had severe headache and took two Aspirine."

# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")

outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)

print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])

Lars

Do you plan on re-training with newer BERT models such as ALBERT?

Since ALBERT has SOTA performance + it is much smaller in model size, I was wondering if you planned on retraining based on ALBERT rather than vanilla BERT.

Using NER for running HuggingFace Pipeline

I tried using this model with HuggingFace's transformers.pipeline to establish baseline on doing NER some data that I have but I was running into index errors based off the fact the the id2label dictionary in the config for the model has only 2 labels in it currently, {0: 'LABEL_0', 1: 'LABEL_1'}. Do you have a full set of labels or should i go about getting these predictions in another way?

Time of Preprocessing

Thanks for the great repo. I tested the preprocessing script. It will process 100 notes every minute, which leads to a total ETA of 15 days. Any idea of expediting this or you spent a similar amount of time?

Pre Trained Model support for Tensorflow v2

I have converted the scripts from tf1 to tf2 and I'm trying to use one of your pre trained models for my pre-training. It's throwing Key bert/embeddings/layer_normalization/beta not found in checkpoint error. I understand that this error is cased by the one of the changed function in tensorflow v2. Where tf.contrib.layers.layer_norm(inputs=input_tensor, begin_norm_axis=-1, begin_params_axis=-1, scope=name) replaced by tf.keras.layers.LayerNormalization(axis=-1)(input_tensor). Btw this is in model.py line 364.

And without using the init_checkpoint everything works fine.

Therefore, I would like to check. Did you have any build model using tensorflow v2 with above LayerNormalizatoin Change.

Getting chinese or japanese words from the pretrained model instead of english

I am just trying masked word prediction on pre-trained bio_clinicalbert but instead of getting english word output I am getting Chinese or Japanese words
Below is my code:

bio_bert_tokenizer = BertTokenizer.from_pretrained('Bio_ClinicalBERT')
bio_bert_model = BertForMaskedLM.from_pretrained('Bio_ClinicalBERT').eval()

input_ids, mask_idx = encode(bio_bert_tokenizer, text_sentence)
with torch.no_grad():
    predict = bert_model(input_ids)[0]
bio_bert = decode(bio_bert_tokenizer, predict[0, mask_idx, :].topk(top_k).indices.tolist(), top_clean)
print(bio_bert)

And The output

I have downloaded all the model files from https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT/tree/main
I don't know if am doing a naive mistake or not please excuse me as am new to this whole transformers library.

unable to install requirements.txt

Would you have an updated requirements.txt most of the modules are not found

Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

scispacy==0.1.0=pypi_0
psutil==5.5.0=pypi_0
awscli==1.16.111=pypi_0
jupyter-highlight-selected-word==0.2.0=pypi_0
astor==0.7.1=pypi_0
grpcio==1.18.0=pypi_0
s3transfer==0.1.13=pypi_0
boto3==1.9.86=pypi_0
statistics==1.0.3.5=pypi_0
jupyter-contrib-core==0.3.3=pypi_0
jupyter-nbextensions-configurator==0.4.1=pypi_0
spacy==2.0.18=pypi_0
tensorflow-gpu==1.12.0=pypi_0
docutils==0.14=pypi_0
markdown==3.0.1=pypi_0
jmespath==0.9.3=pypi_0
tensorboard==1.12.2=pypi_0
lxml==4.3.0=pypi_0
keras-applications==1.0.7=pypi_0
colorama==0.3.9=pypi_0
jupyter-latex-envs==1.4.6=pypi_0
gast==0.2.2=pypi_0
werkzeug==0.14.1=pypi_0
pyyaml==3.13=pypi_0
stanfordcorenlp==3.9.1.1=pypi_0
rsa==3.4.2=pypi_0
keras-preprocessing==1.0.9=pypi_0
termcolor==1.1.0=pypi_0
pytorch-pretrained-bert==0.4.0=pypi_0
en-core-sci-md==0.1.0=pypi_0
en-core-web-sm==2.0.0=pypi_0
stanfordnlp==0.1.0=pypi_0
protobuf==3.6.1=pypi_0
en-core-sci-sm==0.1.0=pypi_0
pyasn1==0.4.5=pypi_0
conllu==1.2.2=pypi_0
scipy==1.2.1=pypi_0
torch==1.0.0=pypi_0
absl-py==0.7.0=pypi_0
regex==2018.1.10=pypi_0
botocore==1.12.101=pypi_0
h5py==2.9.0=pypi_0

Getting Started using transformers

When I am trying to get started with the model "emilyalsentzer/Bio_ClinicalBERT" using the model card at huggingface and this code

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

I get the following error:

Traceback (most recent call last):
  File "test.py", line 4, in <module>
    tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
  File "/Users/Lukas/miniconda3/envs/nlp/lib/python3.7/site-packages/transformers/tokenization_auto.py", line 124, in from_pretrained
    "'xlm', 'roberta', 'ctrl'".format(pretrained_model_name_or_path))
ValueError: Unrecognized model identifier in emilyalsentzer/Bio_ClinicalBERT. Should contains one of 'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', 'xlm', 'roberta', 'ctrl'

I would appreciate any help regarding this.

Clarification on Tokenizer for MedNLI

For MedNLI, it seems as though you had used tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case). Is it correct to say that the bert tokenizer you used for MedNLI is bert-base-cased as opposed to scispacy? If so, what is the thinking behind this?

Using pretrained clinicalBert model for extracting word/sentence or whole clinical note representation

Hi @EmilyAlsentzer,

I tried to extract features as you suggested but faced with a problem. When I run the original BERT example below everything works fine.

echo 'Who was Jim Henson ? ||| Jim Henson was a puppeteer' > /tmp/input.txt

python extract_features.py \
  --input_file=/tmp/input.txt \
  --output_file=/tmp/output.jsonl \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --layers=-1,-2,-3,-4 \
  --max_seq_length=128 \
  --batch_size=8

I changed the bert_config_file and init_checkpoint part and run the below code.

python extract_features.py --input_file=/tmp/input.txt --output_file=/tmp/output.jsonl --vocab_file=bert_pretrain_output_all_notes_150000/vocab.txt --bert_config_file=bert_pretrain_output_all_notes_150000/bert_config.json --init_checkpoint=bert_pretrain_output_all_notes_150000/model.ckpt --layers=-1,-2,-3,-4 --max_seq_length=128

I took the error message below. I think that the problem is with the init_checkpoint part and I try different names like "model.ckpt", "model.ckpt-150000" ... but none of them work.

tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for bert_pretrain_output_all_notes_150000/model.ckpt

So could you please help me to run ClinicalBert to extract features from clinical notes?
Also is it possible to use ClinicalBert to extract embeddings of each word in clinical notes ?
Thanks in advance.

Consequenes when using sequenece input larger than 128

Hi,

First of all, thank you for your great work!

I am trying to fine-tune emilyalsentzer/Bio_Discharge_Summary_BERT on a downstream MLC task. As far as I understand, you initialized bioBert (with max. sequence length of 512) and trained with data that has max. sequence length of 128.

Less than 5% of my tokenized data have length between 128 and 512. Truncation is not an option in my application. I have imbalanced data set and that's why I don't want to filter sequences longer than 128.

My question would be, could fine-tunning with this data raise any issues down the line in terms of model performance?

Sorry, it is a bit of basic question, but I seem not to be able to find a concrete answer to the impacts of choosing a larger max_seq_len than what you trained with.

Thank you again! :)

Vocabulary for the pre-trained model is not updated ? Any reason why

Thanks for making such a comprehensive bert model.

I am worried about the actual words that I find in the model though.
Author mentions that "The Bio_ClinicalBERT model was trained on all notes from MIMIC III, a database containing electronic health records from ICU patients at the Beth Israel Hospital in Boston, MA. For more details on MIMIC". I am supposing this would have mean that the vocab will also be updated.

But when i see the vocabulary words, I don't see medical concepts.

from transformers import TFBertModel,  BertConfig, BertTokenizerFast
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizerFast.from_pretrained('emilyalsentzer/Bio_ClinicalBERT')
tokenizer.vocab.keys()

['Cafe', 'locomotive', 'sob', 'Emilio', 'Amazing', '##ired', 'Lai', 'NSA', 'counts', '##nius', 'assumes', 'talked', 'ク', 'rumor', 'Lund', 'Right', 'Pleasant', 'Aquino', 'Synod', 'scroll', '##cope', 'guitarist', 'AB', '##phere', 'resulted', 'relocation', 'ṣ', 'electors', '##tinuum', 'shuddered', 'Josephine', '"', 'nineteenth', 'hydroelectric', '##genic', '68', '1000', 'offensive', 'Activities', '##ito', 'excluded', '************', 'protruding', '1832', 'perpetual', 'cu', '##36', 'outlet', 'elaborate', '##aft', 'yesterday', '##ope', 'rockets', 'Eduard', 'straining', '510', 'passion', 'Too', 'conferred', 'geography', '38', 'Got', 'snail', 'cellular', '##cation', 'blinked', 'transmitted', 'Pasadena', 'escort', 'bombings', 'Philips', '##cky', 'sacks', '##Ñ', 'jumps', 'Advertising', 'Officer', '##ulp', 'potatoes', 'concentration', 'existed', '##rrigan', '##ier', 'Far', 'models', 'strengthen', 'mechanics'...]

Am i missing something here ?

Also, is there any uncased version present for this model ?

NER on i2b2 datasets

Hi,
thanks for your release. It's a great work. But I still have a question.

In the paper, you mentioned 'Clinical BERT and Clinical BioBERT were applied to four i2b2 NER tasks, all in IOB format'. I want to reproduce this work. But in this repo, you do not release NER codes in 'downstream_tasks' directory. Can you share the code?
Besides, can you tell me how to convert the four i2b2 datasets into BIO format ?

Question: Better to use regular BioBERT on a dataset without marked PHI?

Thanks for this cool resource. I'm just trying to figure out if it's the best model for my project. In the results section of your paper, it says:

De-ID challenge data presents a different data distribution than MIMIC text. In MIMIC, PHI is identified and replaced with sentinel PHI markers, whereas in the de-ID task, PHI is masked with synthetic, but realistic PHI. This data drift would be problematic for any embedding model, but will be especially damaging to contextual embedding models like BERT because the underlying sentence structure will have changed: in raw MIMIC,sentences with PHI will universally have a sentinel PHI token. In contrast, in the de-ID corpus, all such sentences will have different synthetic masks, meaning that a canonical, nearly constant sentence structure present during BERT’s training will be non-existent at task-time. For these reasons, we think it is sensible that clinical BERT is not successful on the de-ID corpora.

I'm working with EHR for patients with multiple myeloma. The records are not de-identified in any way--they're just the regular doctors' notes, lab reports, etc. with real place names, person names, and dates. So to me, it sounds like my data is more like the de-ID dataset than the MIMIC dataset, since PHI aren't tagged in any way. Would I possibly be better off just using the regular BioBERT model then, since that model performed better on the de-ID dataset?

Support TensorFlow 2.0 or TensorFlow 1.15

Hi,

I was wondering if there is ongoing work in publishing pre-trained weights in TF 2.x or TF 1.15 (V1)?

Many thanks,

Instaling dependnices

I am trying to installing the dependencies by running
conda create --name <env> --file requirements.txt
However, since I don't have many of the required channels, my conda cannot install all of them. Moreover, my conda also cannot file the pip packages (pypi_0) from my current channels, even though I have pip. Can you please provide the .yml file of your environment, which contains the full description of the environment, including the conda channels?
Thanks!

Interagtion with Huggingface/Transformers

Hi,

Can you please add this into huggingface/Transformer community model. It can be very useful to avail of built-in functions with the transformer library.

https://github.com/huggingface/transformers/tree/master/templates/adding_a_new_example_script

Here are the details. It will be really helpful to test it on existing scripts.

Thanks

Kanwal

How to run model and finetune it

Hi Guys,
I need to know how to load the clinicalBERT model and run it. clinicalBERT exactly matches with my requirement and i was in searching online but i could't find any useful resources. Can you please help me on this model, Thanks in advanced.

Using the fine tuned models

Hi! I was able to use your clinical BERT models and slightly modified versions of your finetuning code to create a new NER model--worked like a charm :-) However, I'm having trouble loading the saved model and using it to make new predictions. To be more specific, if I set the train and eval flags to false and then run only predict, the model does appear to be able to make predictions on new input, but it is still attempting to split the input data into cross validation folds. I was wondering if you have some code and/or suggestions to avoid this? If not, I can figure it out, but I thought I'd check with you first to make sure I'm not missing anything. Thanks for your time, and thank you for sharing your fantastic work!

404 client error trying to access via Hugging Face

404 Client Error: Not Found for url: https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT/resolve/main/tf_model.h5

It seems like this may be related to the upgrade to ktrain v0.26.x?

Model size & casing

Thanks for the release! Is this based on BERT base or BERT large? Also, is it the cased model or the uncased one?

Need python packages in run_ner.py

Dear @EmilyAlsentzer
Your clinicalBERT is a great work and I want to reimplement it.
The run_ner.py requires python packages, such as 'modeling', 'optimization', and 'tokenization'.
Can you share the packages ?
Thank you

About the pretrain Model

Hi:
First of all, thanks for the sharing of your pretrain model : )
I'm a WPI Data Science Master student, and I'm doing an NLP internship at Umass Medical school. Your pretrain model should be very helpful for me 👍

I have a question.
After I download the pretrain model, I found there are a set of TensorFlow model(1.2G) file and a PyTorch model file (400M).
Are they the same model?

Multi-label classification of clinical text

I' trying to do MLC using the pre-trained weights (trained on all notes in this paper). The data is little biased i.e., some classes more frequently than others. After applying ML-ROS oversampling technique, Mean IRBl reduced, but still the data is biased, so the model is predicting the most frequently occurring labels everytime (for any random input). Do you have any suggestions here?

model_max_len parameter in tokenizer

Hello,

the tokenizer has model_max_len=1000000000000000019884624838656:
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

PreTrainedTokenizerFast(name_or_path='emilyalsentzer/Bio_ClinicalBERT', vocab_size=28996, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

However, it was mentioned in the https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT that maximum sequence length is 128. Could you please explain this moment?

Thanks!

Two clinical BERT repos?

Hi,

Is this repo from the same team as this or are they completely different?

Model not able to initalize weights

Dear all,

I am getting an initalization error about loading the model with BertforTokenClassification. Is there any way to get rid of this error?

Thank You.