mim-solutions / bert_for_longer_texts Goto Github PK

BERT classification model for processing texts longer than 512 tokens. Text is first divided into smaller chunks and after feeding them to BERT, intermediate results are pooled. The implementation allows fine-tuning.

License: Other

Python 51.76% Shell 0.08% Jupyter Notebook 46.66% Makefile 0.66% Batchfile 0.83%

nlp natural-language-processing bert text-classification transfer-learning transformers deep-learning machine-learning pytorch roberta

bert_for_longer_texts's Issues

Would it be okay to use the code below instead of bert?

Original code

pretrained_model_name_or_path: Optional[str] = "bert-base-uncased",

new code

pretrained_model_name_or_path: Optional[str] = "monologg/kobert",

QnA system using BERT

I'm trying to build a QnA system with bert where i will provide a pdf document. As 512 token is the limitation it's unable to take longer texts like a pdf.
I want to know which part of this repository I need to use for bert to work fine even with pdf.

RuntimeError with the following message: "mat1 and mat2 shapes cannot be multiplied (2x512 and 768x1)

I'm encountering a RuntimeError with the following message: "mat1 and mat2 shapes cannot be multiplied (2x512 and 768x1)" when testing the fit and predict methods for a model with pooling using a pretrained model. Has anyone encountered this issue before, and if so, do you have any suggestions on how to resolve it?
Full errors log:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[22], line 1
----> 1 model.fit(X_train, y_train, epochs=1)

File /opt/conda/lib/python3.10/site-packages/belt_nlp/bert.py:80, in BertClassifier.fit(self, x_train, y_train, epochs)
     76 dataloader = DataLoader(
     77     dataset, sampler=RandomSampler(dataset), batch_size=self.batch_size, collate_fn=self.collate_fn
     78 )
     79 for epoch in range(epochs):
---> 80     self._train_single_epoch(dataloader, optimizer)

File /opt/conda/lib/python3.10/site-packages/belt_nlp/bert.py:126, in BertClassifier._train_single_epoch(self, dataloader, optimizer)
    123 for step, batch in enumerate(dataloader):
    125     labels = batch[-1].float().cpu()
--> 126     predictions = self._evaluate_single_batch(batch)
    127     loss = cross_entropy(predictions, labels) / self.accumulation_steps
    128     loss.backward()

File /opt/conda/lib/python3.10/site-packages/belt_nlp/bert_with_pooling.py:124, in BertClassifierWithPooling._evaluate_single_batch(self, batch)
    119 attention_mask_combined_tensors = torch.stack(
    120     [torch.tensor(x).to(self.device) for x in attention_mask_combined]
    121 )
    123 # get model predictions for the combined batch
--> 124 preds = self.neural_network(input_ids_combined_tensors, attention_mask_combined_tensors)
    126 preds = preds.flatten().cpu()
    128 # split result preds into chunks

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /opt/conda/lib/python3.10/site-packages/belt_nlp/bert.py:180, in BertClassifierNN.forward(self, input_ids, attention_mask)
    177 x = x[0][:, 0, :]  # take <s> token (equiv. to [CLS])
    179 # classification head
--> 180 x = self.linear(x)
    181 x = self.sigmoid(x)
    182 return x

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/linear.py:114, in Linear.forward(self, input)
    113 def forward(self, input: Tensor) -> Tensor:
--> 114     return F.linear(input, self.weight, self.bias)

RuntimeError: mat1 and mat2 shapes cannot be multiplied (2x512 and 768x1)

I appreciate any insights or assistance provided!

text length warning

Token indices sequence length is longer than the specified maximum sequence length for this model (2268 > 512). Running this sequence through the model will result in indexing errors
Can I ignore the warning notice above? Why is it popping up?

Loss function and optimizer as parameters

The best option would be to have the loss function as a parameter with the default value.
Similarly, there should be a possibility to choose the optimizer and its parameters.

Originally posted by @mwachnicki in #29 (comment)

use it for multiclass classification

this works for binary classification, how to support it for multiclass classification

Outputting Attentions

Is it possible to output the attentions of each chunk using output_attentions=True

Managing GPU memory for token length more than 4000

Your code helped a lot to understand the chunking process. When i'm trying to fine tune using token length of 4000+ the model breaks with Out of memory exception. I have tried a batch size of 2 and on a larger 48GB GPU as well. I can see we are continuously pushing into GPU which causes memory exhaustion. Is there a way to better manage the memory for samples which are represented by 4000+ tokens.

Split Sizes Throws Error

@MichalBrzozowski91, thanks for this project! Really great stuff.

I get the following error in main.py on a four GPU setup when attempting to fine-tune a BERT model:

With batch size 1:
RuntimeError: split_with_sizes expects split_sizes to sum exactly to 800 (input tensor's size at dimension 0), but got split_sizes=[16]

With batch size 4:
RuntimeError: split_with_sizes expects split_sizes to sum exactly to 2750 (input tensor's size at dimension 0), but got split_sizes=[25, 6, 8, 16]

I would expect number_of_chunks to be of variable size for each record in the batch, but no matter my batch size, I seem to get an error at preds_split = preds.split(number_of_chunks) in main.py.

Any idea what I might be missing?

A few general questions

Hello there! Thank you for this nice project ✨ @mwachnicki @MichalBrzozowski91
I'm really enjoying working through the details! I've just got a few general questions I hope you can help me with.

Let's consider Devlin's example and say we have a 3x6 mini-batch as a result of splitting our input sequence into 3 chunks:

the man went to the store
to the store and bought a
and bought a gallon of milk

BELT allows to process the mini-batch in one go and returns a single, pooled probability value as a result.

Question 1
As far as the attention mechanism goes, am I right to understand that this is applied separately to each chunk? In other words, the tokens in the first chunk do not attend to the tokens in the second and third one, correct?

Question 2
Devlin suggests applying an attention mask to ensure boundary words are not considered twice; in our example to the store and and bought a in the second and third chunk, respectively. Why don't we simply split the original sentence in a way that the chunks do not overlap? For example:

the man went to the store
and bought a gallon of milk

What is the purpose of keeping these overlapping bits if we have to mask them anyway?

Question 3
If my considerations in Question 1 are correct and attention is applied separately on each individual chunk, wouldn't it be beneficial to not mask the overlapping boundary words? Intuitively, I'd say this increases the context of each chunk, making them more similar to each other "in the eyes of the model".

Thanks again for the great work!

Write description what does package do in Readme

At the top there should be explained the purpose of the package.

plz help me

Is there a method for multi-class classification?
Are you currently conducting research on multi-class classification?
I get this warning message when training the model: "Token indices sequence length is longer than the specified maximum sequence length for this model (23716 > 512). Running this sequence through the model will result in indexing errors." Does this warning not affect the model training?

mim-solutions / bert_for_longer_texts Goto Github PK

bert_for_longer_texts's Issues

Would it be okay to use the code below instead of bert?

QnA system using BERT

RuntimeError with the following message: "mat1 and mat2 shapes cannot be multiplied (2x512 and 768x1)

text length warning

Loss function and optimizer as parameters

use it for multiclass classification

Outputting Attentions

Managing GPU memory for token length more than 4000

Split Sizes Throws Error

A few general questions

Write description what does package do in Readme

plz help me

MaskedLM for longer texts

Obtain embedding vectors

running the fit function doesnt give me any verbose

Any example colab scripts to fine tune BERT variations for text multi-class classification tasks?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent