a library for named entity recognition developed by UF HOBI NLP lab featuring SOTA algorithms

License: MIT License

Python 75.37% Shell 0.92% Jupyter Notebook 23.72%

clinicaltransformerner's People

Contributors

Stargazers

Watchers

clinicaltransformerner's Issues

biaffine data generation is quite MEM intensive

see TODO in data_utils.py
consider generating labels/masks using data collate_fn on fly

add support for biaffine function

ref: https://github.com/juntaoy/biaffine-ner

alibi

adapt alibi to BERT in NER

https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/4c13c617bdfb3bd419afd620bae87c74ae5aa79d/megatron/model/transformer.py#L116

potential break from issue https://github.com/huggingface/transformers/issues/6882

In huggingface/transformers#6882, upgrade to transformers v3.1.0 may break the model load.
A temp patch is

model.load_state_dict(state_dict, strict=False)

we will keep on this issue to see if we need to update our code accordingly.

Pretraining transformer models on MIMIC with large setting

we only have MIMIC model using transformer-base models, we need to pretrain large version as well

bert large
roberta large
albert - xxlarege
xlent large
longformer large

preprocessing code for i2b2 dataset

hello, thanks for creating this library. I am trying to reproduce the results for bert on i2b2 2010,2012 and n2c2 2018. However, I have trouble converting these dataset into the conll-2003 txt file shown in test_data. I assume the preprocessing script are different for each dataset because i2b2 2010 (txt, con) and 2012 (txt, extent, tlink) have different file extension.

Is it possible to release the preprocessing scripts for easier reproducibility?

add ONNX support

Support Megatron-LM model for NER

follow up with huggingface/transformers#10911 to include megatron-bert for NER (the PR has not been merged yet)

[Experiment] add a UI to hide commend line input

python UI to wrapper app

change tokenizer to fast tokenization

release by HuggingFace (https://huggingface.co/docs/tokenizers/python/latest/), the rust-based tokenizer is faster than the current tokenizer. We need to update the old tokenizer to the newest ones.

BERT-large (MIMIC)?

Hi, thank you for releasing this excellent resource. I'm wondering if you have released BERT-Large (MIMIC)? The model here only has 12 layers so must be BERT-base? Am I missing something?

integrate w2ner into the package

https://arxiv.org/pdf/2112.10070.pdf
https://github.com/ljynlp/w2ner

We need to integrate this into the package as another pipeline for NER. The w2ner can handle various types of annotations, also achieved STOA on various NER tasks.

add support for reformer roformer

https://huggingface.co/transformers/model_doc/reformer.html
https://huggingface.co/transformers/model_doc/roformer.html

fp16 training using pytorch amp

since PyTorch 1.6.0, a PyTorch amp package is available now for fp.16 training. We will update the code to use the PyTorch amp instead of Apex when it is possible.

No such file or directory: label2idx.json

Hi,

Trying to run a batch prediction as such:

python ./src/run_transformer_batch_prediction.py \
      --model_type bert \
      --pretrained_model models/mimiciii_bert_10e_128b/ \
      --raw_text_dir ./raw-mimic/ \
      --preprocessed_text_dir ./iob-mimic/ \
      --output_dir ./prediction-results \
      --max_seq_length 512 \
      --do_lower_case \
      --eval_batch_size 8 \
      --log_file ./log.txt\
      --do_format 0 \
      --do_copy

Running into this error:

Traceback (most recent call last):
  File "./src/run_transformer_batch_prediction.py", line 123, in <module>
    main(global_args)
  File "./src/run_transformer_batch_prediction.py", line 31, in main
    label2idx = json_load(os.path.join(args.pretrained_model, "label2idx.json"))
  File "/home/ubuntu/mimic2iob/ClinicalTransformerNER/src/common_utils/common_io.py", line 32, in json_load
    with open(ifn, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'models/mimiciii_bert_10e_128b/label2idx.json'

I've downloaded the pre-trained BERT base + MIMIC model from here:
https://transformer-models.s3.amazonaws.com/mimiciii_bert_10e_128b.zip

I don't see label2idx.json present after extracting the archive:

$ ls -ltr models/mimiciii_bert_10e_128b/
total 430396
-rw-r--r-- 1 ubuntu ubuntu    231508 Dec 11  2019 vocab.txt
-rw-r--r-- 1 ubuntu ubuntu       170 Dec 11  2019 tokenizer_config.json
-rw-r--r-- 1 ubuntu ubuntu       112 Dec 11  2019 special_tokens_map.json
-rw-r--r-- 1 ubuntu ubuntu         2 Dec 11  2019 added_tokens.json
-rw-r--r-- 1 ubuntu ubuntu 440470760 Dec 11  2019 pytorch_model.bin
-rw-r--r-- 1 ubuntu ubuntu       566 Dec 11  2019 config.json

Any help would be much appreciated. Thanks for your project!

Compatible with Transformers >= 2.11.0

Since 2.11.0, transformers altered several API names which can cause breaks in the current package, we will work on this issue to make the current package more compatible with various versions of Transformers.

Xlnet doesn't suport use_biaffine

python src/run_transformer_ner.py
--model_type xlnet
--pretrained_model xlnet-base-cased
--data_dir ./test_data/conll-2003
--new_model_dir ./new_bert_ner_model
--overwrite_model_dir
--predict_output_file ./bert_pred.txt
--max_seq_length 256
--save_model_core
--do_train
--do_predict
--model_selection_scoring strict-f_score-1
--do_lower_case
--train_batch_size 8
--eval_batch_size 8
--train_steps 500
--learning_rate 1e-5
--num_train_epochs 1
--gradient_accumulation_steps 1
--do_warmup
--seed 13
--warmup_ratio 0.1
--max_num_checkpoints 3
--log_file ./log.txt
--progress_bar
--early_stop 3

Traceback (most recent call last):
File "/data/datasets/yonghui/project/ClinicalTransformerNER/src/run_transformer_ner.py", line 169, in main
run_task(global_args)
File "/data/datasets/yonghui/project/ClinicalTransformerNER/src/transformer_ner/task.py", line 604, in run_task
model = model_model.from_pretrained(args.pretrained_model, config=config)
File "/home/yonghui.wu/.pyenv/versions/anaconda3-2021.11/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2024, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/data/datasets/yonghui/project/ClinicalTransformerNER/src/transformer_ner/model.py", line 308, in init
if config.use_biaffine:
File "/home/yonghui.wu/.pyenv/versions/anaconda3-2021.11/lib/python3.9/site-packages/transformers/configuration_utils.py", line 253, in getattribute
return super().getattribute(key)
AttributeError: 'XLNetConfig' object has no attribute 'use_biaffine'

add deepspeed support

support for multi-GPU and deepspeed

Performance of XLNet and Longformer?

Thanks again for providing this repository and actively maintaining it. Do you have performance of XLNet and Longformer on the 2010 i2b2 test set, 2012 i2b2 test set, and/or 2018 n2c2 test set readily available and shareable?

add DeBERTa support

add DeBERTa for BERT:

implementation
test on CONLL-2003
test on 2010-i2b2

uf-hobi-informatics-lab / clinicaltransformerner Goto Github PK

clinicaltransformerner's People

Contributors

Stargazers

Watchers

Forkers

clinicaltransformerner's Issues

Recommend Projects

Recommend Topics

Recommend Org