uf-hobi-informatics-lab / clinicaltransformerner Goto Github PK
View Code? Open in Web Editor NEWa library for named entity recognition developed by UF HOBI NLP lab featuring SOTA algorithms
License: MIT License
a library for named entity recognition developed by UF HOBI NLP lab featuring SOTA algorithms
License: MIT License
see TODO in data_utils.py
consider generating labels/masks using data collate_fn on fly
In huggingface/transformers#6882, upgrade to transformers v3.1.0 may break the model load.
A temp patch is
model.load_state_dict(state_dict, strict=False)
we will keep on this issue to see if we need to update our code accordingly.
we only have MIMIC model using transformer-base models, we need to pretrain large version as well
hello, thanks for creating this library. I am trying to reproduce the results for bert on i2b2 2010,2012 and n2c2 2018. However, I have trouble converting these dataset into the conll-2003 txt file shown in test_data. I assume the preprocessing script are different for each dataset because i2b2 2010 (txt, con) and 2012 (txt, extent, tlink) have different file extension.
Is it possible to release the preprocessing scripts for easier reproducibility?
follow up with huggingface/transformers#10911 to include megatron-bert for NER (the PR has not been merged yet)
python UI to wrapper app
release by HuggingFace (https://huggingface.co/docs/tokenizers/python/latest/), the rust-based tokenizer is faster than the current tokenizer. We need to update the old tokenizer to the newest ones.
Hi, thank you for releasing this excellent resource. I'm wondering if you have released BERT-Large (MIMIC)? The model here only has 12 layers so must be BERT-base? Am I missing something?
https://arxiv.org/pdf/2112.10070.pdf
https://github.com/ljynlp/w2ner
We need to integrate this into the package as another pipeline for NER. The w2ner can handle various types of annotations, also achieved STOA on various NER tasks.
since PyTorch 1.6.0, a PyTorch amp package is available now for fp.16 training. We will update the code to use the PyTorch amp instead of Apex when it is possible.
Hi,
Trying to run a batch prediction as such:
python ./src/run_transformer_batch_prediction.py \
--model_type bert \
--pretrained_model models/mimiciii_bert_10e_128b/ \
--raw_text_dir ./raw-mimic/ \
--preprocessed_text_dir ./iob-mimic/ \
--output_dir ./prediction-results \
--max_seq_length 512 \
--do_lower_case \
--eval_batch_size 8 \
--log_file ./log.txt\
--do_format 0 \
--do_copy
Running into this error:
Traceback (most recent call last):
File "./src/run_transformer_batch_prediction.py", line 123, in <module>
main(global_args)
File "./src/run_transformer_batch_prediction.py", line 31, in main
label2idx = json_load(os.path.join(args.pretrained_model, "label2idx.json"))
File "/home/ubuntu/mimic2iob/ClinicalTransformerNER/src/common_utils/common_io.py", line 32, in json_load
with open(ifn, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'models/mimiciii_bert_10e_128b/label2idx.json'
I've downloaded the pre-trained BERT base + MIMIC model from here:
https://transformer-models.s3.amazonaws.com/mimiciii_bert_10e_128b.zip
I don't see label2idx.json present after extracting the archive:
$ ls -ltr models/mimiciii_bert_10e_128b/
total 430396
-rw-r--r-- 1 ubuntu ubuntu 231508 Dec 11 2019 vocab.txt
-rw-r--r-- 1 ubuntu ubuntu 170 Dec 11 2019 tokenizer_config.json
-rw-r--r-- 1 ubuntu ubuntu 112 Dec 11 2019 special_tokens_map.json
-rw-r--r-- 1 ubuntu ubuntu 2 Dec 11 2019 added_tokens.json
-rw-r--r-- 1 ubuntu ubuntu 440470760 Dec 11 2019 pytorch_model.bin
-rw-r--r-- 1 ubuntu ubuntu 566 Dec 11 2019 config.json
Any help would be much appreciated. Thanks for your project!
Since 2.11.0, transformers altered several API names which can cause breaks in the current package, we will work on this issue to make the current package more compatible with various versions of Transformers.
python src/run_transformer_ner.py
--model_type xlnet
--pretrained_model xlnet-base-cased
--data_dir ./test_data/conll-2003
--new_model_dir ./new_bert_ner_model
--overwrite_model_dir
--predict_output_file ./bert_pred.txt
--max_seq_length 256
--save_model_core
--do_train
--do_predict
--model_selection_scoring strict-f_score-1
--do_lower_case
--train_batch_size 8
--eval_batch_size 8
--train_steps 500
--learning_rate 1e-5
--num_train_epochs 1
--gradient_accumulation_steps 1
--do_warmup
--seed 13
--warmup_ratio 0.1
--max_num_checkpoints 3
--log_file ./log.txt
--progress_bar
--early_stop 3
Traceback (most recent call last):
File "/data/datasets/yonghui/project/ClinicalTransformerNER/src/run_transformer_ner.py", line 169, in main
run_task(global_args)
File "/data/datasets/yonghui/project/ClinicalTransformerNER/src/transformer_ner/task.py", line 604, in run_task
model = model_model.from_pretrained(args.pretrained_model, config=config)
File "/home/yonghui.wu/.pyenv/versions/anaconda3-2021.11/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2024, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/data/datasets/yonghui/project/ClinicalTransformerNER/src/transformer_ner/model.py", line 308, in init
if config.use_biaffine:
File "/home/yonghui.wu/.pyenv/versions/anaconda3-2021.11/lib/python3.9/site-packages/transformers/configuration_utils.py", line 253, in getattribute
return super().getattribute(key)
AttributeError: 'XLNetConfig' object has no attribute 'use_biaffine'
support for multi-GPU and deepspeed
Thanks again for providing this repository and actively maintaining it. Do you have performance of XLNet and Longformer on the 2010 i2b2 test set, 2012 i2b2 test set, and/or 2018 n2c2 test set readily available and shareable?
add DeBERTa for BERT:
if a model name is biobert-large-cased-v1.1, the output (formatted output) name will become biobert-large-cased-v1 (.1 was removed).
We need to either add a warning or to replace "." with another char inside the prediction script.
currently, we do not have a train from where it left function. Every training starts from a new model (at least a new linear classification layer).
We need to implement continuing training function to support use cases like we want to train more epochs on the same data or train on new data with the exact same labels (no new labels are allowed)
this will improve both inference and training efficiency.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.