helboukkouri / character-bert Goto Github PK

Main repository for "CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters"

License: Apache License 2.0

Python 98.52% Shell 1.48%

character-bert's Introduction

CharacterBERT

This is the code repository for the paper "CharacterBERT: Reconciling ELMo and BERT for Word-LevelOpen-Vocabulary Representations From Characters" that came out at COLING 2020.

2021-02-25: Code for pre-training BERT and CharacterBERT is now available here!

Paper summary
- TL;DR
- Motivations
How do I use CharacterBERT?
How do I pre-train CharacterBERT?
How do I reproduce the paper's results?
Running experiments on GPUs
References

Paper summary

TL;DR

CharacterBERT is a variant of BERT that produces contextual representations at the word level.

This is achieved by attending to the characters of each input token and dynamically building token representations from that. In fact, contrary to standard BERT--which relies on a matrix of pre-defined wordpieces, this approach uses a CharacterCNN module, similar to ELMo, that can generate representations for arbitrary input tokens.

The figure above shows how context-independent representations are built in BERT, vs. how they are built in CharacterBERT. Here, we assume that "Apple" is an unknown token, which results in BERT splitting the token into two wordpieces "Ap" and "##ple" and embedding each unit. On the other hand, CharacterBERT processes the token "Apple" as is, then attends to its characters in order to produce a single token embedding.

Motivations

CharacterBERT has two main motivations:

It is frequent to adapt the original BERT to new specialized domains (e.g. medical, legal, scientific domains..) by simply re-training it on a set of specialized corpora. This results in the original (general domain) wordpiece vocabulary being re-used despite the final model being actually targeted toward a different potentially highly specialized domain, which is arguably suboptimal (see Section 2 of the paper).

A straightforward solution in this case would be to train a new BERT from scratch with a specialized wordpiece vocabulary. However, training a single BERT is already costly enough let alone needing to train one for each and every domain of interest.
BERT uses a wordpiece system to strike a good balance between the specificity of tokens and flexibility of characters. However, working with subwords is not the most convenient in practice (should we average the representations to get the original token embedding for word similarity tasks? should we only use the first wordpiece of each token in sequence labelling tasks? ...) and most would just prefer to work with tokens.

Inspired by ELMo, we use a CharacterCNN module and manage to get a variant of BERT that produces both word-level and contextual representations which can also be re-adapted as many times as necessary, on any domain, without needing to worry about the suitability of any wordpieces. And as a cherry on top, attending to the characters of each input token also leads us to a more robust model against any typos or misspellings (see Section 5.5 of the paper).

How do I use CharacterBERT?

Installation

We recommend using a virtual environment that is specific to using CharacterBERT.

If you do not already have conda installed, you can install Miniconda from this link. Then, check that your conda is up to date:

conda update -n base -c defaults conda

Create a fresh conda environment:

conda create python=3.10 --name=character-bert

If not already activated, activate the new conda environment using:

conda activate character-bert

Then install the following packages:

conda install pytorch cudatoolkit=11.8 -c pytorch
pip install transformers==4.34.0 scikit-learn==1.3.1 gdown==4.7.1

Note 1: If you will not be running experiments on a GPU, install pyTorch via this command instead:
conda install pytorch cpuonly -c pytorch

Note 2: If you just want to be able to load pre-trained CharacterBERT weigths, you do not have to install scikit-learn which is only used for computing Precision, Recall, F1 metrics during evaluation.

Pre-trained models

You can use the download.py script to download any of the models below:

Keyword	Model description
general_character_bert	General Domain CharacterBERT pre-trained from scratch on English Wikipedia and OpenWebText.
medical_character_bert	Medical Domain CharacterBERT initialized from general_character_bert then further pre-trained on MIMIC-III clinical notes and PMC OA biomedical paper abstracts.
general_bert	General Domain BERT pre-trained from scratch on English Wikipedia and OpenWebText. ¹
medical_bert	Medical Domain BERT initialized from general_bert then further pre-trained on MIMIC-III clinical notes and PMC OA biomedical paper abstracts. ²
bert-base-uncased	The original General Domain BERT (base, uncased)

^{1, 2} We offer BERT models as well as CharacterBERT models since we have pre-trained both architectures in an effort to fairly compare these architectures. Our BERT models use the same architecture and starting wordpiece vocabulary as bert-base-uncased.

For instance, let's download the medical version of CharacterBERT:

python download.py --model='medical_character_bert'

We can download also download all models in a single command:

python download.py --model='all'

Using CharacterBERT in practice

CharacterBERT's architecture is almost identical to BERT's, so you can easilly adapt any code that from the Transformers library.

Example 1: getting word embeddings from CharacterBERT

"""Basic example: getting word embeddings from CharacterBERT"""
from transformers import BertTokenizer
from modeling.character_bert import CharacterBertModel
from utils.character_cnn import CharacterIndexer

# Example text
x = "Hello World!"

# Tokenize the text
tokenizer = BertTokenizer.from_pretrained(
    './pretrained-models/bert-base-uncased/')
x = tokenizer.basic_tokenizer.tokenize(x)

# Add [CLS] and [SEP]
x = ['[CLS]', *x, '[SEP]']

# Convert token sequence into character indices
indexer = CharacterIndexer()
batch = [x]  # This is a batch with a single token sequence x
batch_ids = indexer.as_padded_tensor(batch)

# Load some pre-trained CharacterBERT
model = CharacterBertModel.from_pretrained(
    './pretrained-models/medical_character_bert/')

# Feed batch to CharacterBERT & get the embeddings
embeddings_for_batch, _ = model(batch_ids)
embeddings_for_x = embeddings_for_batch[0]
print('These are the embeddings produces by CharacterBERT (last transformer layer)')
for token, embedding in zip(x, embeddings_for_x):
    print(token, embedding)

Example 2: using CharacterBERT for binary classification

""" Basic example: using CharacterBERT for binary classification """
from transformers import BertForSequenceClassification, BertConfig
from modeling.character_bert import CharacterBertModel

#### LOADING BERT FOR CLASSIFICATION ####

config = BertConfig.from_pretrained('bert-base-uncased', num_labels=2)  # binary classification
model = BertForSequenceClassification(config=config)

model.bert.embeddings.word_embeddings  # wordpiece embeddings
>>> Embedding(30522, 768, padding_idx=0)

#### REPLACING BERT WITH CHARACTER_BERT ####

character_bert_model = CharacterBertModel.from_pretrained(
    './pretrained-models/medical_character_bert/')
model.bert = character_bert_model

model.bert.embeddings.word_embeddings  # wordpieces are replaced with a CharacterCNN
>>> CharacterCNN(
        (char_conv_0): Conv1d(16, 32, kernel_size=(1,), stride=(1,))
        (char_conv_1): Conv1d(16, 32, kernel_size=(2,), stride=(1,))
        (char_conv_2): Conv1d(16, 64, kernel_size=(3,), stride=(1,))
        (char_conv_3): Conv1d(16, 128, kernel_size=(4,), stride=(1,))
        (char_conv_4): Conv1d(16, 256, kernel_size=(5,), stride=(1,))
        (char_conv_5): Conv1d(16, 512, kernel_size=(6,), stride=(1,))
        (char_conv_6): Conv1d(16, 1024, kernel_size=(7,), stride=(1,))
        (_highways): Highway(
        (_layers): ModuleList(
            (0): Linear(in_features=2048, out_features=4096, bias=True)
            (1): Linear(in_features=2048, out_features=4096, bias=True)
        )
        )
        (_projection): Linear(in_features=2048, out_features=768, bias=True)
    )

#### PREPARING RAW TEXT ####

from transformers import BertTokenizer
from utils.character_cnn import CharacterIndexer

text = "CharacterBERT attends to each token's characters"
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenized_text = bert_tokenizer.basic_tokenizer.tokenize(text) # this is NOT wordpiece tokenization

tokenized_text
>>> ['characterbert', 'attends', 'to', 'each', 'token', "'", 's', 'characters']

indexer = CharacterIndexer()  # This converts each token into a list of character indices
input_tensor = indexer.as_padded_tensor([tokenized_text])  # we build a batch of only one sequence
input_tensor.shape
>>> torch.Size([1, 8, 50])  # (batch_size, sequence_length, character_embedding_dim)

#### USING CHARACTER_BERT FOR INFERENCE ####

output = model(input_tensor, return_dict=False)[0]
>>> tensor([[-0.3378, -0.2772]], grad_fn=<AddmmBackward>)  # class logits

For more complete (but still illustrative) examples you can refer to the run_experiments.sh script which runs a few Classification/SequenceLabelling experiments using BERT/CharacterBERT.

bash run_experiments.sh

You can adapt the run_experiments.sh script to try out any available model. You should also be able to add real classification and sequence labelling tasks by adapting the data.py script.

Running experiments on GPUs

In order to use GPUs you will need to make sure the PyTorch version that is in your conda environment matches your machine's configuration. To do that, you may want to run a few tests.

Let's assume you want to use the GPU n°0 on your machine. Then set:

export CUDA_VISIBLE_DEVICES=0

And run these commands to check whether pytorch can detect your GPU:

import torch
print(torch.cuda.is_available())  # Should return `True`

If the last command returns False, then there is probably a mismatch between the installed PyTorch version and your machine's configuration. To fix that, run nvidia-smi in your terminal and check your driver version:

Then compare this version with the numbers given in the NVIDIA CUDA Toolkit Release Notes:

In this example the shown version is 390.116 which corresponds to CUDA 9.0. This means that the appropriate command for installing PyTorch is:

conda install pytorch cudatoolkit=9.0 -c pytorch

Now, everything should work fine!

References

Please cite our paper if you use CharacterBERT in your work:

@inproceedings{el-boukkouri-etal-2020-characterbert,
    title = "{C}haracter{BERT}: Reconciling {ELM}o and {BERT} for Word-Level Open-Vocabulary Representations From Characters",
    author = "El Boukkouri, Hicham  and
      Ferret, Olivier  and
      Lavergne, Thomas  and
      Noji, Hiroshi  and
      Zweigenbaum, Pierre  and
      Tsujii, Jun{'}ichi",
    booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",
    month = dec,
    year = "2020",
    address = "Barcelona, Spain (Online)",
    publisher = "International Committee on Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.coling-main.609",
    doi = "10.18653/v1/2020.coling-main.609",
    pages = "6903--6915",
    abstract = "Due to the compelling improvements brought by BERT, many recent representation models adopted the Transformer architecture as their main building block, consequently inheriting the wordpiece tokenization system despite it not being intrinsically linked to the notion of Transformers. While this system is thought to achieve a good balance between the flexibility of characters and the efficiency of full words, using predefined wordpiece vocabularies from the general domain is not always suitable, especially when building models for specialized domains (e.g., the medical domain). Moreover, adopting a wordpiece tokenization shifts the focus from the word level to the subword level, making the models conceptually more complex and arguably less convenient in practice. For these reasons, we propose CharacterBERT, a new variant of BERT that drops the wordpiece system altogether and uses a Character-CNN module instead to represent entire words by consulting their characters. We show that this new model improves the performance of BERT on a variety of medical domain tasks while at the same time producing robust, word-level, and open-vocabulary representations.",
}

character-bert's People

Contributors

Stargazers

Watchers

character-bert's Issues

CharBERT out of date with latest Transformers? + using charBERT to output probability of sequence of characters?

Hello, thank you very much for the package!

So as I began to use it, I noticed that it seems to only run with transformers v3.5.0 - the package is now on v.4.4.2, so it's been moving fast. The new changes mean that the core modeling functions don't work because of the reorganization of the BERT internals. I managed to get it to work by installing an older version of Transformers, but I thought I'd give you a heads up.

Also - using the character BERT model, can I use it as a language model to tell me the probability of a particular sequence? In the sense that correct sequences, actual words, are scored with a high probability, and garble or nonsense is scored with a low probability? How exactly would I poke around the layers to get what I want?

Thank you since again!!

Word-level padding vs Character-level padding

Hi @helboukkouri
The max numbers of letters in a word is set to 50. So for a word with 5 characters is getting padded to 50.
For padding, a value of 260 is used for each character.
Then again, to make each sentence in a batch same size we are padding with words. Say to make a sentence of length 5 (5 words) pad to 8, three PAD tokens are being added. In this case each PAD token is also 50 character length but each character is getting a padding value of ZERO.
Why are you using two different types of padding?
Another thing, after converting each word to ids, you are adding 1 to each id. ( in the file character-bert/utils/character_cnn.py line 125 in the function def convert_word_to_char_ids(self, word: str) -> List[int]: and the comment says # +1 one for masking
What is the reason of adding 1 ?
Thanks in advance.

How to finetuning the model

Hi
Do you have a script for finetuning?
I need to know how to prepare the data for finetuning.
Thanks

Huggingface implementation

Great stuff! I saw you have an active huggingface implementation:
https://huggingface.co/helboukkouri/character-bert/tree/main

Is this live? Or when do you expect it to be?

Thanks!

unroling of the word representations for the loss function

I was reading the paper, and wondering how you exactly handle the unroling of the word representations for the loss function. I found the paper leaking a bit in detail and I also wasn't able to easily find it in your implementation.
Could you tell me how char-bert is handling that?

Thanks
XMaster

Small error in the Readme

Hello, thanks a lot for developing character-bert.

I was trying character-bert but when I execute the following command:

tokenizer = BertTokenizer.from_pretrained('./pretrained-models/bert-base-uncased/')

I had the following error:

OSError: Model name './pretrained-models/bert-base-uncased/' was not found in tokenizers model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1, wietsedv/bert-base-dutch-cased). We assumed './pretrained-models/bert-base-uncased/' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.txt'] but couldn't find such vocabulary files at this path or url.

The problem is pretty trivial to solve, just replace './pretrained-models/bert-base-uncased/' by 'bert-base-uncased'

Regards,

BUG in the downloader

The links it contain (google drive) do not lead to the end tar file but to an HTML page

So it downloads them and tries to untar them, which of course leads to an error

There is some code to correct for this but it does not work

CharacterCNN mask not use ?

Hello,

I thank you for sharing the code for making your experiment reproductible.
I have few question regarding the class CharacterCNN :

the variable mask is not used in the forward
same for these two variables :

      self._beginning_of_sentence_characters = torch.from_numpy(
           numpy.array(CharacterMapper.beginning_of_sentence_characters) + 1
       )
       self._end_of_sentence_characters = torch.from_numpy(
           numpy.array(CharacterMapper.end_of_sentence_characters) + 1
       )

is it normal ?

MEDNLI noisy text

Hello, thank you for your work, can you please provide the code to create noisy text from the MEDNLI dataset.

Thank you in advance.

Does this support mask filling task?

Hello @helboukkouri ,

I am wondering if this model supports mask filling task. If so, how will the token be added? Thank you!

Fine tuning with Trainer from huggingface

There is any chance to you guys post a example of how we use your model to do a fine-tuning using the Trainer class from huggingface?

Error when Downloading Pretrained models

Hi,

This library is such a great idea!

I am trying to download the general_character_bert and medical_character_bert pretrained models. When I run the download.py I got some errors. When I downloaded the files manually, I discovered that both files are the same... so
maybe there is an error in the links? And also you may need to change the permission to 'Anyone with the link'.

Thanks!

I get these errors:

26/04/2024 12:06:50 - INFO - download.py - Downloading general_character_bert model (~200MB tar.xz archive)
Access denied with the following error:

Cannot retrieve the public link of the file. You may need to change
the permission to 'Anyone with the link', or have had many accesses.

You may still be able to access the file from the browser:

 https://drive.google.com/uc?id=11-kSfIwSWrPno6A4VuNFWuQVYD8Bg_aZ

26/04/2024 12:06:51 - INFO - download.py - Extracting model from archive (~420MB folder)
Traceback (most recent call last):
File "/Users/mauriciotoro/miniconda/envs/character-bert/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Users/mauriciotoro/miniconda/envs/character-bert/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/Users/mauriciotoro/Desktop/git/ml-categorization/playground/character_bert/download.py", line 80, in
main()
File "/Users/mauriciotoro/Desktop/git/ml-categorization/playground/character_bert/download.py", line 77, in main
download_model(name=args.model)
File "/Users/mauriciotoro/Desktop/git/ml-categorization/playground/character_bert/download.py", line 54, in download_model
tar = tarfile.open(file_destination, "r:xz")
File "/Users/mauriciotoro/miniconda/envs/character-bert/lib/python3.10/tarfile.py", line 1824, in open
return func(name, filemode, fileobj, **kwargs)
File "/Users/mauriciotoro/miniconda/envs/character-bert/lib/python3.10/tarfile.py", line 1930, in xzopen
fileobj = LZMAFile(fileobj or name, mode, preset=preset)
File "/Users/mauriciotoro/miniconda/envs/character-bert/lib/python3.10/lzma.py", line 120, in init
self._fp = builtins.open(filename, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'pretrained-models/model.tar.xz'

Which layer should I use if I only want to embed char

Hi, I want to use your character BERT in my research, but I'm confused which layer should I choose if I only want to do char-level embedding. Should I just use the CharacterCNN? Thank u!

CharacterBertTokenizer support for latin characters

Hi, I was using the code from https://github.com/helboukkouri/transformers/tree/add-character-bert to generate the input_ids for the model that I trained for the Portuguese language when I noticed the following situation:

I was using this code to generate the input for the model:

text = "Você e maçã"
tokenizer = CharacterBertTokenizer()
tokens = tokenizer.tokenize(text) # this return ['voce', 'e', 'maca']
input_ids = tokenizer.encode(text, return_tensors='pt')

and the value of tokens variable made me think if the CharacterBertTokenizer has the support for the latin characters such as ç, ê and ã (since they were removed by c, e and a respectively) or if I'm using that class in a wrong way, could you please clarify this to me?

Multilingual version

Hi! Do you happen to have a multilingual char BERT model trained as well?

Printing character level vectors

Hi,

You're printing words and their embeddings using:

for token, embedding in zip(x, embeddings_for_x):
    print(token, embedding)

How can I see each letter's vector?

Generating word embeddings

Hi, I was wondering if there was a simple way to generate word embeddings with character-bert

CharacterBERT in information retrieval task

Hi @helboukkouri, I think you may be interested in our SIGIR2022 work CharacterBERT and Self-Teaching for Improving the Robustness of Dense Retrievers on Queries with Typos. We found that CharacterBERT plus some adversarial training technics can make dense retrievers much more robust on misspelled queries in the IR task.

Access hidden layers of character-bert

Hi @helboukkouri,

Thanks for sharing this amazing work !

I wanted to tap out the 13 hidden layers activations also. For doing that I pass output_hidden_states=True to the encoder in the forward function. Now, the model returns a tuple with 3 entries where 3rd entry contains a list of activations for the 13 layers. Please let me know if this is the correct way.

The outputs of each layers are of the shape [Num_words, 11, 768]. Would it be alright to take the average across the 11 characters to get the intermediate word representation or you would recommend something else. Actually, I was expecting the outputs of each layer to have the form [Num_words*11, 768] (similar to word piece) or simple [Num_words, 768].

Kindly help me in understanding this aspect.

pre-trained models for bert-multilingual-base-uncased

Hi, I was wondering if you have pre-trained character-bert models for bert-multilingual-base-uncased which was trained in 104 languages.

How do I use word embeddings?

Hi, @helboukkouri , So from the example, I get embeddings for each token. I am thinking to get the representation of a whole sentence for downstream tasks, such as sentence similarity, classification, etc. My idea is to use word embeddings directly with a lstm layer to represent a whole sentence. Do you think this is the right way? Thanks!

MLM head pre-trained weights of character-bert

Hi,

I just found that the input embeddings of character-bert encoder are not tied with the output MLM weights. I would appreciate it if you could also release the pre-trained general-character-bert models with the MLM head pre-trained weights, as our group would like to explore further pre-training using character-bert.

Many Thanks

Pretrain character bert with new data

hi, @helboukkouri , one more question. If I want to pretrain character bert with a new large dataset, based on general-char-bert, what should I do? Like in huggingface, they have a run_language_modeling script that could continue to train with bert-base or other models

How do I pre-train CharacterBERT?

Hi,

can you please share the code/tutorial to train from scratch ?

Thanks

Traceback (most recent call last):
  File "download.py", line 105, in <module>
    main()
  File "download.py", line 102, in main
    download_model(name=args.model)
  File "download.py", line 79, in download_model
    tar = tarfile.open(file_destination, "r:xz")
  File "/users/troux/.conda/envs/character-bert/lib/python3.8/tarfile.py", line 1621, in open
    return func(name, filemode, fileobj, **kwargs)
  File "/users/troux/.conda/envs/character-bert/lib/python3.8/tarfile.py", line 1734, in xzopen
    raise ReadError("not an lzma file")
tarfile.ReadError: not an lzma file