lasigebiotm / k-ret Goto Github PK

K-RET: Knowledgeable Biomedical Relation Extraction System

License: Apache License 2.0

Python 98.97% Shell 1.03%

bert-fine-tuning biomedical-text-mining deep-learning nlp-machine-learning relation-extraction text-mining bert

k-ret's Introduction

K-RET: Knowledgeable Biomedical Relation Extraction System

K-RET is a flexible biomedical RE system, allowing for the use of any pre-trained BERT-based system (e.g., SciBERT and BioBERT) to inject knowledge in the form of knowledge graphs from a single source or multiple sources simultaneously. This knowledge can be applied to various contextualizing tokens or just to the tokens of the candidate relation for single and multi-token entities.

Our academic paper which describes K-RET in detail can be found here.

The uer folder corresponds to an updated version of the toolkit developed by Zhao et al. (2019) available here.

Downloading Pre-Trained Models

You should use both a baseline model and one of our pre-trained models to make predictions on new data. If you wish to train a new model on your data, you only need a baseline model, which can be either model referenced in our academic paper.

Baseline Models

After downloading a baseline model, for instance SciBERT, the model needs to be converted using the uer toolkit. For this, you can run the following example, making the necessary adaptations given different baseline models or paths.

cd K-RET/uer/
python3 convert_bert_from_huggingface_to_uer.py --input_model_path ../models/pre_trained_model_scibert/scibert_scivocab_uncased/pytorch_model.bin --output_model_path ../models/pre_trained_model_scibert/output_model.bin

Our Models

Available versions of the best performing pre-trained models are as follows:

The training details are described in our academic paper.

Getting Started

Our project includes code adaption of the K-BERT model available here. Use the K-RET Image available at Docker Hub to set up the rest of the experimental environment.

Usage Example

 CUDA_VISIBLE_DEVICES='1,2,3' python3 -u run_classification.py \
    --pretrained_model_path ./models/pre_trained_model_scibert/output_model.bin \
    --config_path ./models/pre_trained_model_scibert/scibert_scivocab_uncased/config.json \
    --vocab_path ./models/pre_trained_model_scibert/scibert_scivocab_uncased/vocab.txt \
    --train_path ./datasets/ddi_corpus/train.tsv \
    --dev_path ./datasets/ddi_corpus/dev.tsv \
    --test_path ./datasets/ddi_corpus/test.tsv \
    --class_weights True \
    --weights "[0.234, 3.377, 4.234, 6.535, 24.613]" \
    --epochs_num 30 \
    --batch_size 32 \
    --kg_name "['ChEBI']" \
    --output_model_path ./outputs/scibert_ddi.bin | tee ./outputs/scibert_ddi.log &

For more options check run.sh and, for additional configuration settings (e.g., max_number_entities and contextual_knowledge), check brain/config.py.

Predict New Data Example

CUDA_VISIBLE_DEVICES='0' python3 -u run_classification.py \
    --pretrained_model_path ./models/pre_trained_model_scibert/output_model.bin \
    --config_path ./models/pre_trained_model_scibert/scibert_scivocab_uncased/config.json \
    --vocab_path ./models/pre_trained_model_scibert/scibert_scivocab_uncased/vocab.txt \
    --train_path ./datasets/ddi_corpus/train.tsv \
    --dev_path ./datasets/ddi_corpus/dev.tsv \
    --class_weights True \
    --weights "[0.234, 3.377, 4.234, 6.535, 24.613]" \
    --test_path ./datasets/ddi_corpus/test.tsv \
    --epochs_num 30 --batch_size 32 --kg_name "[]" \
    --testing True \
    --to_test_model ./outputs/scibert_ddi.bin \
    | tee ./outputs/ddi_results.log &

Process Results Example

python3 src/process_results.py ./outputs/ddi_results.log ./datasets/ddi_corpus/test.tsv ddi_results.tsv

Reference

Diana Sousa and Francisco M. Couto. 2022. K-RET: Knowledgeable Biomedical Relation Extraction System. Bioinformatics.

k-ret's People

Contributors

Stargazers

Watchers

Forkers

aprmswra mls0928 molesong vedore

k-ret's Issues

Cannot use the code you provide to convert roberta

Cannot use the code you provide to convert roberta
the following error will happen
Traceback (most recent call last):
File "convert_bert_from_huggingface_to_uer.py", line 73, in
main()
File "convert_bert_from_huggingface_to_uer.py", line 49, in main
output_model["embedding.word_embedding.weight"] = input_model["bert.embeddings.word_embeddings.weight"]
KeyError: 'bert.embeddings.word_embeddings.weight'

root@jm-Z490-AORUS-ULTRA:/workspaces/K-RET#  CUDA_VISIBLE_DEVICES='0,1' python3 -u run_classification.py \
>     --pretrained_model_path /workspaces/K-RET/models/pre_trained_model_scibert/output_model.bin \
>     --config_path /workspaces/K-RET/models/pre_trained_model_scibert/scibert_scivocab_uncased/config.json \
>     --vocab_path /workspaces/K-RET/models/pre_trained_model_scibert/scibert_scivocab_uncased/vocab.txt \
>     --train_path /workspaces/K-RET/datasets/pgr_corpus/train.tsv \
>     --dev_path /workspaces/K-RET/datasets/pgr_corpus/dev.tsv \
>     --test_path /workspaces/K-RET/datasets/pgr_corpus/test.tsv \
>     --class_weights True \
>     --weights "[0.234, 3.377, 4.234, 6.535, 24.613]" \
>     --epochs_num 30 \
>     --batch_size 32 \
>     --kg_name "['ChEBI']" \
>     --output_model_path /workspaces/K-RET/outputs/scibert_ddi.bin | tee /workspaces/K-RET/outputs/scibert_ddi.log &
[1] 421
root@jm-Z490-AORUS-ULTRA:/workspaces/K-RET# 
root@jm-Z490-AORUS-ULTRA:/workspaces/K-RET# [nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Vocabulary file line 30107 has bad format token
Vocabulary Size:  31090
Namespace(batch_size=32, bidirectional=False, block_size=2, class_weights='True', config_path='/workspaces/K-RET/models/pre_trained_model_scibert/scibert_scivocab_uncased/config.json', dev_path='/workspaces/K-RET/datasets/pgr_corpus/dev.tsv', dropout=0.1, emb_size=768, encoder='bert', epochs_num=30, feedforward_size=3072, heads_num=12, hidden_size=768, kernel_size=3, kg_name="['ChEBI']", labels_num=2, layers_num=12, learning_rate=2e-05, mean_reciprocal_rank=False, no_vm=False, output_model_path='/workspaces/K-RET/outputs/scibert_ddi.bin', pooling='first', pretrained_model_path='/workspaces/K-RET/models/pre_trained_model_scibert/output_model.bin', report_steps=100, seed=7, seq_length=256, sub_layers_num=2, sub_vocab_path='models/sub_vocab.txt', subencoder='avg', subword_type='none', target='bert', test_path='/workspaces/K-RET/datasets/pgr_corpus/test.tsv', testing=False, to_test_model=None, tokenizer='bert', train_path='/workspaces/K-RET/datasets/pgr_corpus/train.tsv', vocab=<uer.utils.vocab.Vocab object at 0x7f001c4a5160>, vocab_path='/workspaces/K-RET/models/pre_trained_model_scibert/scibert_scivocab_uncased/vocab.txt', warmup=0.1, weights='[0.234, 3.377, 4.234, 6.535, 24.613]', workers_num=1)
[BertClassifier] use visible_matrix: True
2 GPUs are available. Let's use them.
[KnowledgeGraph] Loading spo from /workspaces/K-RET/brain/kgs/chebi.spo
Start training.
Loading sentences from /workspaces/K-RET/datasets/pgr_corpus/train.tsv
There are 4050 sentence in total. We use 1 processes to inject knowledge into sentences.
Progress of process 0: 0/4050
Shuffling dataset
Trans data to tensor.
input_ids
label_ids
mask_ids
pos_ids
vms
Batch size:  32
The number of training instances: 4050
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL Error 1: unhandled cuda error
Fatal Python error: Aborted

Current thread 0x00007f0108ce5740 (most recent call first):
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/comm.py", line 40 in broadcast_coalesced
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/_functions.py", line 21 in forward
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/replicate.py", line 13 in replicate
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 147 in replicate
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 142 in forward
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489 in __call__
  File "run_classification.py", line 581 in main
  File "run_classification.py", line 622 in <module>

lasigebiotm / k-ret Goto Github PK

k-ret's Introduction

K-RET: Knowledgeable Biomedical Relation Extraction System

Downloading Pre-Trained Models

Baseline Models

Our Models

Getting Started

Usage Example

Predict New Data Example

Process Results Example

Reference

k-ret's People

Contributors

Stargazers

Watchers

Forkers

k-ret's Issues

Recommend Projects

Recommend Topics

Recommend Org