Giter VIP home page Giter VIP logo

clinicalbert's People

Contributors

emilyalsentzer avatar tnaumann avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

clinicalbert's Issues

Clarification on Tokenizer for MIMIC

I don't see the BERT tokenizers utilized in the code for the MIMIC fine tuning (seems to be that these have a custom tokenizer code), and this doesn't appear to do the WordPiece tokenization used in the rest of BERT. You do appear to use the BERT-based tokenizer for the MedNLI task. Please clarify.

inquiry about cosine similarity between tokenized sentences

Hi there; you're honestly doing God's work here and sharing it on hugging face.

I am however very confused by how to appropriately use this tool. I was originally trying to tokenize sentences with the clinicalbert trained on discharge summaries; and tried to see if it was able to recognize similar medical terminologies and lump them together, or return high similarity words. So far, it seems like the base bert performs better. Would there ever be a world where your work gets extended into a stsb type of a model?

Can't parse serialized Example

Looks like failed to load the model from the ckpt, any hint? thanks.

I0731 15:33:06.451731 140067755902720 basic_session_run_hooks.py:606] Saving checkpoints for 0 into /home/ec2-user/robin/clinicalBERT/output/model/model.ckpt.
2022-07-31 15:33:31.584083: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2022-07-31 15:33:31.795626: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_ids. Can't parse serialized Example.
2022-07-31 15:33:31.795633: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_mask. Can't parse serialized Example.
2022-07-31 15:33:31.795777: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_ids. Can't parse serialized Example.
2022-07-31 15:33:31.795937: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_ids. Can't parse serialized Example.
2022-07-31 15:33:31.795937: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_mask. Can't parse serialized Example.
2022-07-31 15:33:31.796025: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: masked_lm_ids. Can't parse serialized Example.
2022-07-31 15:33:31.796329: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: masked_lm_positions. Can't parse serialized Example.
2022-07-31 15:33:31.796348: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_ids. Can't parse serialized Example.
2022-07-31 15:33:31.796660: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_mask. Can't parse serialized Example.
2022-07-31 15:33:31.796796: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: segment_ids. Can't parse serialized Example.
2022-07-31 15:33:31.796889: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_mask. Can't parse serialized Example.
2022-07-31 15:33:31.796975: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: masked_lm_weights. Can't parse serialized Example.
2022-07-31 15:33:31.797047: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: masked_lm_positions. Can't parse serialized Example.
2022-07-31 15:33:31.797134: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_ids. Can't parse serialized Example.
2022-07-31 15:33:31.797213: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: masked_lm_positions. Can't parse serialized Example.
2022-07-31 15:33:31.797291: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_ids. Can't parse serialized Example.
ERROR:tensorflow:Error recorded from training_loop: 2 root error(s) found.
(0) Invalid argument: Key: input_ids. Can't parse serialized Example.
[[{{node ParseSingleExample/ParseSingleExample}}]]
[[IteratorGetNext]]
(1) Invalid argument: Key: input_ids. Can't parse serialized Example.
[[{{node ParseSingleExample/ParseSingleExample}}]]
[[IteratorGetNext]]
[[IteratorGetNext/_4973]]

Question about mimic data preparation process

Hi Emily,

After I acquire an access to MIMIC III database, I preprocess this data following your procedure (i.e. format_mimic_for_BERT.py).

But, I can not have a confidence about below results. Is it right result?

(after format_mimic_for_BERT.py)

Thanks
Young-Jun

Clinical BERT, initialized from BERTBase is not available

Thanks for making clinical bert publicly available. In the paper, it was mentioned liked "We train and publicly release BERT-Base and BioBERT-finetuned models trained on both all clinical notes and only discharge summaries".

But only BioBERT finetuned models are available . I would like to know, when will you release BERT based fine tuned models?

Weight initialization Error on using pretrained model in pytorch

I am getting this error

Weights of BertForMultiLable not initialized from pretrained model: ['classifier.weight', 'classifier.bias']
Weights from pretrained model not used in BertForMultiLable: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']

while using the provided pretrained biobert from here in this repo.
here is the issue for reference.

Is there some issue in the model?

Known issue with section splitting in heuristic_tokenize.py

There are two bugs in the sent_tokenize_rules function in heuristic_tokenize.py

We have not fixed them in this repo because we want to maintain the reproducibility of
our code at the time the work was published. However, anyone wanting to extend this work should make the following changes in heuristic_tokenize.py:

  1. fix a bug on line #168 where . should be replaced with \. i.e. should be while re.search('\n\s*%d\.'%n,segment):
  2. add else statement (else: new_segments.append(segments[i])) to the if statement at line 287 if (i == N-1) or is_title(segments[i+1]): This fixes a bug where lists that have a title header will lose their first entry.

Datasets

Hello @EmilyAlsentzer,

This is a great contribution to the open source community! I have read your paper thoroughly: https://www.aclweb.org/anthology/W19-1909.pdf

I have few questions:

  1. I would love to try out both clinicalBERT + BioBERT on few downstream tasks (disease identification), however I donot have lot of training dataset (infact zero training datasets) . Could you please point me to some available open source data repositories which already have: notes --> disease, mapping?

  2. I see you have used typical BERT pertaining approach(MLM), however I would like to explore other pertaining strategies such as (Replaced Token Detection, from ELECTRA etc.)
    I also see for pertaining you have used MIMIC-III datasets, I dont have access to this dataset, to evaluate. What would you suggest for pertaining datasets?

  3. I also would love try new variants of transformers (larger ones, low parameter ones) + do multitask learning , so datasets (de-ID, non PHI sufficient) seems to be bottleneck, how to over come this ?

Would open source all my work in py-torch, if I could find a tangible data source. Please let me know. Thanks!

missing mli_train_v1.jsonl

Hi,
run_classifier.py is looking for this json file..mli_train_v1.jsonl
how to get or construct this file?

Unable to reproduce results

I can't seem to reproduce your results on MedNLI with the two released models using the same hyperparameters presented in your paper's Appendix B. You reported 84-85%, but I can only get to 81-82% on the test set. Do you know why? Are the reported results on the dev set or the test set? If relevant, I'm using the pytorch-pretrained-bert repo.

Which labels to use with transformers libary ?

Hello,

this looks like a great piece of work, thank you for making it available. I tried to explore clinicalBERT for some NER tasks using the transformers library. I can obtain a list of token index results from torch.argmax, but I cannot find a suitable a set of labels (predictions is containing values as large as 687). What am I doing wrong ?

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

label_list = ["B-IDNUM", "I-IDNUM", "B-HOSPITAL", "I-HOSPITAL", 'B-PATIENT', 'I-PATIENT', 'B-PHONE', 'I-PHONE',
        'B-DATE', 'I-DATE', 'B-DOCTOR', 'I-DOCTOR', 'B-LOCATION-OTHER', 'I-LOCATION-OTHER', 'B-AGE', 'I-AGE', 'B-BIOID', 'I-BIOID',
        'B-STATE', 'I-STATE','B-ZIP', 'I-ZIP', 'B-HEALTHPLAN', 'I-HEALTHPLAN', 'B-ORGANIZATION', 'I-ORGANIZATION',
        'B-MEDICALRECORD', 'I-MEDICALRECORD', 'B-CITY', 'I-CITY', 'B-STREET', 'I-STREET', 'B-COUNTRY', 'I-COUNTRY',
        'B-URL', 'I-URL',
        'B-USERNAME', 'I-USERNAME', 'B-PROFESSION', 'I-PROFESSION', 'B-FAX', 'I-FAX', 'B-EMAIL', 'I-EMAIL', 'B-DEVICE', 'I-DEVICE',
        'O', "X", "[CLS]", "[SEP]"]

sequence = "Patient had severe headache and took two Aspirine."

# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")

outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)

print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])

Lars

Using NER for running HuggingFace Pipeline

I tried using this model with HuggingFace's transformers.pipeline to establish baseline on doing NER some data that I have but I was running into index errors based off the fact the the id2label dictionary in the config for the model has only 2 labels in it currently, {0: 'LABEL_0', 1: 'LABEL_1'}. Do you have a full set of labels or should i go about getting these predictions in another way?

Time of Preprocessing

Thanks for the great repo. I tested the preprocessing script. It will process 100 notes every minute, which leads to a total ETA of 15 days. Any idea of expediting this or you spent a similar amount of time?

Pre Trained Model support for Tensorflow v2

I have converted the scripts from tf1 to tf2 and I'm trying to use one of your pre trained models for my pre-training. It's throwing Key bert/embeddings/layer_normalization/beta not found in checkpoint error. I understand that this error is cased by the one of the changed function in tensorflow v2. Where tf.contrib.layers.layer_norm(inputs=input_tensor, begin_norm_axis=-1, begin_params_axis=-1, scope=name) replaced by tf.keras.layers.LayerNormalization(axis=-1)(input_tensor). Btw this is in model.py line 364.

And without using the init_checkpoint everything works fine.

Therefore, I would like to check. Did you have any build model using tensorflow v2 with above LayerNormalizatoin Change.

Getting chinese or japanese words from the pretrained model instead of english

I am just trying masked word prediction on pre-trained bio_clinicalbert but instead of getting english word output I am getting Chinese or Japanese words
Below is my code:

bio_bert_tokenizer = BertTokenizer.from_pretrained('Bio_ClinicalBERT')
bio_bert_model = BertForMaskedLM.from_pretrained('Bio_ClinicalBERT').eval()
input_ids, mask_idx = encode(bio_bert_tokenizer, text_sentence)
with torch.no_grad():
    predict = bert_model(input_ids)[0]
bio_bert = decode(bio_bert_tokenizer, predict[0, mask_idx, :].topk(top_k).indices.tolist(), top_clean)
print(bio_bert)

And The output
Screenshot from 2021-04-02 14-43-26
I have downloaded all the model files from https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT/tree/main
I don't know if am doing a naive mistake or not please excuse me as am new to this whole transformers library.

unable to install requirements.txt

Would you have an updated requirements.txt most of the modules are not found

Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

  • scispacy==0.1.0=pypi_0
  • psutil==5.5.0=pypi_0
  • awscli==1.16.111=pypi_0
  • jupyter-highlight-selected-word==0.2.0=pypi_0
  • astor==0.7.1=pypi_0
  • grpcio==1.18.0=pypi_0
  • s3transfer==0.1.13=pypi_0
  • boto3==1.9.86=pypi_0
  • statistics==1.0.3.5=pypi_0
  • jupyter-contrib-core==0.3.3=pypi_0
  • jupyter-nbextensions-configurator==0.4.1=pypi_0
  • spacy==2.0.18=pypi_0
  • tensorflow-gpu==1.12.0=pypi_0
  • docutils==0.14=pypi_0
  • markdown==3.0.1=pypi_0
  • jmespath==0.9.3=pypi_0
  • tensorboard==1.12.2=pypi_0
  • lxml==4.3.0=pypi_0
  • keras-applications==1.0.7=pypi_0
  • colorama==0.3.9=pypi_0
  • jupyter-latex-envs==1.4.6=pypi_0
  • gast==0.2.2=pypi_0
  • werkzeug==0.14.1=pypi_0
  • pyyaml==3.13=pypi_0
  • stanfordcorenlp==3.9.1.1=pypi_0
  • rsa==3.4.2=pypi_0
  • keras-preprocessing==1.0.9=pypi_0
  • termcolor==1.1.0=pypi_0
  • pytorch-pretrained-bert==0.4.0=pypi_0
  • en-core-sci-md==0.1.0=pypi_0
  • en-core-web-sm==2.0.0=pypi_0
  • stanfordnlp==0.1.0=pypi_0
  • protobuf==3.6.1=pypi_0
  • en-core-sci-sm==0.1.0=pypi_0
  • pyasn1==0.4.5=pypi_0
  • conllu==1.2.2=pypi_0
  • scipy==1.2.1=pypi_0
  • torch==1.0.0=pypi_0
  • absl-py==0.7.0=pypi_0
  • regex==2018.1.10=pypi_0
  • botocore==1.12.101=pypi_0
  • h5py==2.9.0=pypi_0

Getting Started using transformers

When I am trying to get started with the model "emilyalsentzer/Bio_ClinicalBERT" using the model card at huggingface and this code

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

I get the following error:

Traceback (most recent call last):
  File "test.py", line 4, in <module>
    tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
  File "/Users/Lukas/miniconda3/envs/nlp/lib/python3.7/site-packages/transformers/tokenization_auto.py", line 124, in from_pretrained
    "'xlm', 'roberta', 'ctrl'".format(pretrained_model_name_or_path))
ValueError: Unrecognized model identifier in emilyalsentzer/Bio_ClinicalBERT. Should contains one of 'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', 'xlm', 'roberta', 'ctrl'

I would appreciate any help regarding this.

Clarification on Tokenizer for MedNLI

For MedNLI, it seems as though you had used tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case). Is it correct to say that the bert tokenizer you used for MedNLI is bert-base-cased as opposed to scispacy? If so, what is the thinking behind this?

Using pretrained clinicalBert model for extracting word/sentence or whole clinical note representation

Hi @EmilyAlsentzer,

I tried to extract features as you suggested but faced with a problem. When I run the original BERT example below everything works fine.

echo 'Who was Jim Henson ? ||| Jim Henson was a puppeteer' > /tmp/input.txt

python extract_features.py \
  --input_file=/tmp/input.txt \
  --output_file=/tmp/output.jsonl \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --layers=-1,-2,-3,-4 \
  --max_seq_length=128 \
  --batch_size=8

I changed the bert_config_file and init_checkpoint part and run the below code.

python extract_features.py --input_file=/tmp/input.txt --output_file=/tmp/output.jsonl --vocab_file=bert_pretrain_output_all_notes_150000/vocab.txt --bert_config_file=bert_pretrain_output_all_notes_150000/bert_config.json --init_checkpoint=bert_pretrain_output_all_notes_150000/model.ckpt --layers=-1,-2,-3,-4 --max_seq_length=128

I took the error message below. I think that the problem is with the init_checkpoint part and I try different names like "model.ckpt", "model.ckpt-150000" ... but none of them work.

tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for bert_pretrain_output_all_notes_150000/model.ckpt

So could you please help me to run ClinicalBert to extract features from clinical notes?
Also is it possible to use ClinicalBert to extract embeddings of each word in clinical notes ?
Thanks in advance.

Consequenes when using sequenece input larger than 128

Hi,

First of all, thank you for your great work!

I am trying to fine-tune emilyalsentzer/Bio_Discharge_Summary_BERT on a downstream MLC task. As far as I understand, you initialized bioBert (with max. sequence length of 512) and trained with data that has max. sequence length of 128.

Less than 5% of my tokenized data have length between 128 and 512. Truncation is not an option in my application. I have imbalanced data set and that's why I don't want to filter sequences longer than 128.

My question would be, could fine-tunning with this data raise any issues down the line in terms of model performance?

Sorry, it is a bit of basic question, but I seem not to be able to find a concrete answer to the impacts of choosing a larger max_seq_len than what you trained with.

Thank you again! :)

Vocabulary for the pre-trained model is not updated ? Any reason why

Thanks for making such a comprehensive bert model.

I am worried about the actual words that I find in the model though.
Author mentions that "The Bio_ClinicalBERT model was trained on all notes from MIMIC III, a database containing electronic health records from ICU patients at the Beth Israel Hospital in Boston, MA. For more details on MIMIC". I am supposing this would have mean that the vocab will also be updated.

But when i see the vocabulary words, I don't see medical concepts.

from transformers import TFBertModel,  BertConfig, BertTokenizerFast
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizerFast.from_pretrained('emilyalsentzer/Bio_ClinicalBERT')
tokenizer.vocab.keys()

['Cafe', 'locomotive', 'sob', 'Emilio', 'Amazing', '##ired', 'Lai', 'NSA', 'counts', '##nius', 'assumes', 'talked', 'ク', 'rumor', 'Lund', 'Right', 'Pleasant', 'Aquino', 'Synod', 'scroll', '##cope', 'guitarist', 'AB', '##phere', 'resulted', 'relocation', 'ṣ', 'electors', '##tinuum', 'shuddered', 'Josephine', '"', 'nineteenth', 'hydroelectric', '##genic', '68', '1000', 'offensive', 'Activities', '##ito', 'excluded', '************', 'protruding', '1832', 'perpetual', 'cu', '##36', 'outlet', 'elaborate', '##aft', 'yesterday', '##ope', 'rockets', 'Eduard', 'straining', '510', 'passion', 'Too', 'conferred', 'geography', '38', 'Got', 'snail', 'cellular', '##cation', 'blinked', 'transmitted', 'Pasadena', 'escort', 'bombings', 'Philips', '##cky', 'sacks', '##Ñ', 'jumps', 'Advertising', 'Officer', '##ulp', 'potatoes', 'concentration', 'existed', '##rrigan', '##ier', 'Far', 'models', 'strengthen', 'mechanics'...]

Am i missing something here ?

Also, is there any uncased version present for this model ?

NER on i2b2 datasets

Hi,
thanks for your release. It's a great work. But I still have a question.

In the paper, you mentioned 'Clinical BERT and Clinical BioBERT were applied to four i2b2 NER tasks, all in IOB format'. I want to reproduce this work. But in this repo, you do not release NER codes in 'downstream_tasks' directory. Can you share the code?
Besides, can you tell me how to convert the four i2b2 datasets into BIO format ?

Question: Better to use regular BioBERT on a dataset without marked PHI?

Thanks for this cool resource. I'm just trying to figure out if it's the best model for my project. In the results section of your paper, it says:

De-ID challenge data presents a different data distribution than MIMIC text. In MIMIC, PHI is identified and replaced with sentinel PHI markers, whereas in the de-ID task, PHI is masked with synthetic, but realistic PHI. This data drift would be problematic for any embedding model, but will be especially damaging to contextual embedding models like BERT because the underlying sentence structure will have changed: in raw MIMIC,sentences with PHI will universally have a sentinel PHI token. In contrast, in the de-ID corpus, all such sentences will have different synthetic masks, meaning that a canonical, nearly constant sentence structure present during BERT’s training will be non-existent at task-time. For these reasons, we think it is sensible that clinical BERT is not successful on the de-ID corpora.

I'm working with EHR for patients with multiple myeloma. The records are not de-identified in any way--they're just the regular doctors' notes, lab reports, etc. with real place names, person names, and dates. So to me, it sounds like my data is more like the de-ID dataset than the MIMIC dataset, since PHI aren't tagged in any way. Would I possibly be better off just using the regular BioBERT model then, since that model performed better on the de-ID dataset?

Instaling dependnices

I am trying to installing the dependencies by running
conda create --name <env> --file requirements.txt
However, since I don't have many of the required channels, my conda cannot install all of them. Moreover, my conda also cannot file the pip packages (pypi_0) from my current channels, even though I have pip. Can you please provide the .yml file of your environment, which contains the full description of the environment, including the conda channels?
Thanks!

How to run model and finetune it

Hi Guys,
I need to know how to load the clinicalBERT model and run it. clinicalBERT exactly matches with my requirement and i was in searching online but i could't find any useful resources. Can you please help me on this model, Thanks in advanced.

Using the fine tuned models

Hi! I was able to use your clinical BERT models and slightly modified versions of your finetuning code to create a new NER model--worked like a charm :-) However, I'm having trouble loading the saved model and using it to make new predictions. To be more specific, if I set the train and eval flags to false and then run only predict, the model does appear to be able to make predictions on new input, but it is still attempting to split the input data into cross validation folds. I was wondering if you have some code and/or suggestions to avoid this? If not, I can figure it out, but I thought I'd check with you first to make sure I'm not missing anything. Thanks for your time, and thank you for sharing your fantastic work!

Model size & casing

Thanks for the release! Is this based on BERT base or BERT large? Also, is it the cased model or the uncased one?

Need python packages in run_ner.py

Dear @EmilyAlsentzer
Your clinicalBERT is a great work and I want to reimplement it.
The run_ner.py requires python packages, such as 'modeling', 'optimization', and 'tokenization'.
Can you share the packages ?
Thank you

About the pretrain Model

Hi:
First of all, thanks for the sharing of your pretrain model : )
I'm a WPI Data Science Master student, and I'm doing an NLP internship at Umass Medical school. Your pretrain model should be very helpful for me 👍

I have a question.
After I download the pretrain model, I found there are a set of TensorFlow model(1.2G) file and a PyTorch model file (400M).
Are they the same model?

Multi-label classification of clinical text

I' trying to do MLC using the pre-trained weights (trained on all notes in this paper). The data is little biased i.e., some classes more frequently than others. After applying ML-ROS oversampling technique, Mean IRBl reduced, but still the data is biased, so the model is predicting the most frequently occurring labels everytime (for any random input). Do you have any suggestions here?

model_max_len parameter in tokenizer

Hello,

the tokenizer has model_max_len=1000000000000000019884624838656:
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

PreTrainedTokenizerFast(name_or_path='emilyalsentzer/Bio_ClinicalBERT', vocab_size=28996, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

However, it was mentioned in the https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT that maximum sequence length is 128. Could you please explain this moment?

Thanks!

Model not able to initalize weights

Dear all,

I am getting an initalization error about loading the model with BertforTokenClassification. Is there any way to get rid of this error?

image

Thank You.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.