cambridgeltl / sapbert Goto Github PK

[NAACL'21 & ACL'21] SapBERT: Self-alignment pretraining for BERT & XL-BEL: Cross-Lingual Biomedical Entity Linking.

Home Page: https://www.aclweb.org/anthology/2021.naacl-main.334

License: MIT License

Jupyter Notebook 26.01% Python 68.60% Shell 5.39%

nlp machine-learning language-model lexical-semantics representation-learning naacl2021 acl2021 bionlp bert contrastive-learning

sapbert's Introduction

SapBERT: Self-alignment pretraining for BERT

[news | 22 Aug 2021] SapBERT is integrated into NVIDIA's deep learning toolkit NeMo as its entity linking module (thank you NVIDIA!). You can play with it in this google colab.

This repo holds code, data, and pretrained weights for (1) the SapBERT model presented in our NAACL 2021 paper: Self-Alignment Pretraining for Biomedical Entity Representations; (2) the cross-lingual SapBERT and a cross-lingual biomedical entity linking benchmark (XL-BEL) proposed in our ACL 2021 paper: Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking.

Huggingface Models

English Models: [SapBERT] and [SapBERT-mean-token]

Standard SapBERT as described in [Liu et al., NAACL 2021]. Trained with UMLS 2020AA (English only), using microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext as the base model. For [SapBERT], use [CLS] (before pooler) as the representation of the input; for [SapBERT-mean-token], use mean-pooling across all tokens.

Cross-Lingual Models: [SapBERT-XLMR] and [SapBERT-XLMR-large]

Cross-lingual SapBERT as described in [Liu et al., ACL 2021]. Trained with UMLS 2020AB (all languages), using xlm-roberta-base/xlm-roberta-large as the base model. Use [CLS] (before pooler) as the representation of the input.

Environment

The code is tested with python 3.8, torch 1.7.0 and huggingface transformers 4.4.2. Please view requirements.txt for more details.

Embedding Extraction with SapBERT

The following script converts a list of strings (entity names) into embeddings.

import numpy as np
import torch
from tqdm.auto import tqdm
from transformers import AutoTokenizer, AutoModel  

tokenizer = AutoTokenizer.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext")  
model = AutoModel.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext").cuda()

# replace with your own list of entity names
all_names = ["covid-19", "Coronavirus infection", "high fever", "Tumor of posterior wall of oropharynx"] 

bs = 128 # batch size during inference
all_embs = []
for i in tqdm(np.arange(0, len(all_names), bs)):
    toks = tokenizer.batch_encode_plus(all_names[i:i+bs], 
                                       padding="max_length", 
                                       max_length=25, 
                                       truncation=True,
                                       return_tensors="pt")
    toks_cuda = {}
    for k,v in toks.items():
        toks_cuda[k] = v.cuda()
    cls_rep = model(**toks_cuda)[0][:,0,:] # use CLS representation as the embedding
    all_embs.append(cls_rep.cpu().detach().numpy())

all_embs = np.concatenate(all_embs, axis=0)

Please see inference/inference_on_snomed.ipynb for a more extensive inference example.

Train SapBERT

Extract training data from UMLS as insrtructed in training_data/generate_pretraining_data.ipynb (we cannot directly release the training file due to licensing issues).

Run:

>> cd train/
>> ./pretrain.sh 0,1

where 0,1 specifies the GPU devices.

For finetuning on your customised dataset, generate data in the format of

concept_id || entity_name_1 || entity_name_2
...

where entity_name_1 and entity_name_2 are synonym pairs (belonging to the same concept concept_id) sampled from a given labelled dataset. If one concept is associated with multiple entity names in the dataset, you could traverse all the pairwise combinations.

For cross-lingual SAP-tuning with general domain parallel data (muse, wiki titles, or both), the data can be found in training_data/general_domain_parallel_data/. An example script: train/xling_train.sh.

Evaluate SapBERT

For evaluation (both monlingual and cross-lingual), please view evaluation/README.md for details. evaluation/xl_bel/ contains the XL-BEL benchmark proposed in [Liu et al., ACL 2021].

Citations

SapBERT:

@inproceedings{liu2021self,
	title={Self-Alignment Pretraining for Biomedical Entity Representations},
	author={Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel},
	booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
	pages={4228--4238},
	month = jun,
	year={2021}
}

Cross-lingual SapBERT and XL-BEL:

@inproceedings{liu2021learning,
	title={Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking},
	author={Liu, Fangyu and Vuli{\'c}, Ivan and Korhonen, Anna and Collier, Nigel},
	booktitle={Proceedings of ACL-IJCNLP 2021},
	pages = {565--574},
	month = aug,
	year={2021}
}

Acknowledgement

Parts of the code are modified from BioSyn. We appreciate the authors for making BioSyn open-sourced.

License

SapBERT is MIT licensed. See the LICENSE file for details.

sapbert's People

Contributors

Stargazers

Watchers

sapbert's Issues

The evaluation data download link doesn't work

It says

Link temporarily disabled
This can happen when the link has been shared or downloaded too many times in a day.

Check again later and we’ll open access to more people.

Statistical significance test

Dear authors,

Could you please share the python script for statistical significance computation? I think you used T-test.

Thank you in advance

Entity Span --> CUI

Hi - what's the simplest way to go from entity span to CUI? It seems the HuggingFace model just gets you the [CLS] hidden state representation and you need to use that to find the nearest neighbor in UMLS. But it wasn't clear how to get that UMLS index

dictionary for custom dataset

Hello and thank you for sharing this project.

I want to evaluate this model on my custom dataset. How can I create a dictionary for that?

requirements.txt doesn't help resolving requirements.

I cannot fulfill the correct requirements for executing your kindly prepared google-colab sheet. Below I summarized my failure.

First cell ERROR:

tensorflow 2.9.2 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.3 which is incompatible.
tensorflow 2.9.2 requires tensorboard<2.10,>=2.9, but you have tensorboard 2.11.2 which is incompatible.

I don't know if the above error poses a problem, however, I get another
ERROR (something like):

ModuleNotFoundError: No module named 'torchtext.legacy'.

subsequently I downgrade (stackoverflow recommendation) using !pip install torchtext==0.10.0, giving me
ERROR:
torchvision 0.14.1+cu116 requires torch==1.13.1, but you have torch 1.9.0 which is incompatible.

So I upgrade to 1.13.1.
Then I get

     11 from torchtext.data.utils import RandomShuffler
     12 from .example import Example
---> 13 from torchtext.utils import download_from_url, unicode_csv_reader
     14 
     15 

ImportError: cannot import name 'unicode_csv_reader' from 'torchtext.utils' (/usr/local/lib/python3.8/dist-packages/torchtext/utils.py)

At some point I also get

torchvision 0.14.1+cu116 requires torch==1.13.1, but you have torch 1.7.0 which is incompatible.
torchtext 0.10.0 requires torch==1.9.0, but you have torch 1.7.0 which is incompatible.
torchaudio 0.13.1+cu116 requires torch==1.13.1, but you have torch 1.7.0 which is incompatible.

mention detection (how to?)

Hi SapBERT authors, this code is so valuable!

With this code, I can link biomedical entities to the UMLS entities. In my case, I want to give a sentence-query, like needed for the evaluations of the unsupervised trained model. I wonder how the mention detection was performed (not described in paper)?

1. ko_1k_test_query_with_context.txt, 2. finetuning, and 3. inference api

Hi, I have a several questions

First, in xl_bel, I could find ko_1k_test_query_with_context.txt. Many sentences consisting the txt file are irrelevant with disease. I want to know why and how you used this txt file.

Second, we are trying to give a input text with Korean and output with English. Mainly dealing medical area will be
rheumatology. Can you tell me the procedure to implement this?

Finally, I asked in huggingface ( cambridgeltl/SapBERT-UMLS-2020AB-all-lang-from-XLMR ) that,
I have question about the output value of inference api in huggingface.
In my assumption,
Do row value related to disease id and column value related to input text?
I want to know the what row and column mean, and if row value is disease id as I thought, where can I find the exact disease name matching to those disease id?

Thanks

MedMentions Dictionary file created

Hello Team
Thank you for your great contribution. Can you please brief me on how was the medmentions dictionary file was created to run evaluations.

Thanks
Saranya

Details on fine-tuning data

Hello and thanks for sharing this great project.

Regarding fine-tuning of SapBERT, the README states

For finetuning on your customised dataset, generate data in the format of [...]
where entity_name_1 and entity_name_2 are synonym pairs (belonging to the same concept concept_id) sampled from a given labelled dataset.

Are there any examples on how this looks exactly for the datasets (NCBI Disease, Cometa, etc.) used in the evaluation?

Tokenizer

Is there any information on the tokenizer from HuggingFace?

tokenizer = AutoTokenizer.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext")

I assume it's the same as PubmedBERT, which I presume is using an 'in-domain' vocabulary. Just would love confirmation! thanks