csebuetnlp / banglabert Goto Github PK

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla" accpeted in Findings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: NAACL-2022.

Shell 11.23% Python 88.77%

bangla-nlp bangla-language-processing bangla-natural-language-processing sentiment-classification document-classification emotion-classification named-entity-recognition natural-language-inference textual-entailment bert

banglabert's Introduction

BanglaBERT

This repository contains the official release of the model "BanglaBERT" and associated downstream fine-tuning code and datasets introduced in the paper titled "BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla" published in Findings of the Association for Computational Linguistics: NAACL 2022.

Updates

We have released BanglaBERT (small). It can be fine-tuned with as little as 4 GB VRAM!
We have released a large variant of BanglaBERT! Have a look here.
The Bangla2B+ pretraining corpus is now available upon request! See here.

BanglaBERT

Models

The pretrained model checkpoints are available at Huggingface model hub.

To use these models for the supported downstream tasks in this repository see Training & Evaluation.

Note: These models were pretrained using a specific normalization pipeline available here. All finetuning scripts in this repository uses this normalization by default. If you need to adapt the pretrained model for a different task make sure the text units are normalized using this pipeline before tokenizing to get best results. A basic example is available at the model page.

Datasets

We are also releasing the Bangla Natural Language Inference (NLI) and Bangla Question Answering (QA) datasets introduced in the paper.

NLI
QA

Please fill out this Google Form to request access to the Bangla2B+ pretraining corpus.

Setup

For installing the necessary requirements, use the following bash snippet

$ git clone https://github.com/csebuetnlp/banglabert
$ cd banglabert/
$ conda create python==3.7.9 pytorch==1.8.1 torchvision==0.9.1 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch -p ./env
$ conda activate ./env # or source activate ./env (for older versions of anaconda)
$ bash setup.sh

Use the newly created environment for running the scripts in this repository.

Training & Evaluation

To use the pretrained model for finetuning / inference on different downstream tasks see the following section:

Sequence Classification.
- For single sequence classification such as
  - Document classification
  - Sentiment classification
  - Emotion classification etc.
- For double sequence classification such as
  - Natural Language Inference (NLI)
  - Paraphrase detection etc.

Token Classification.
- For token tagging / classification tasks such as
  - Named Entity Recognition (NER)
  - Parts of Speech Tagging (PoS) etc.
Question Answering.
- For tasks such as,
  - Extractive Question Answering
  - Open-domain Question Answering

Benchmarks

Zero-shot cross-lingual transfer-learning

Model	Params	SC (macro-F1)	NLI (accuracy)	NER (micro-F1)	QA (EM/F1)	BangLUE score
mBERT	180M	27.05	62.22	39.27	59.01/64.18	50.35
XLM-R (base)	270M	42.03	72.18	45.37	55.03/61.83	55.29
XLM-R (large)	550M	49.49	78.13	56.48	71.13/77.70	66.59
BanglishBERT	110M	48.39	75.26	55.56	72.87/78.63	66.14

Supervised fine-tuning

Model	Params	SC (macro-F1)	NLI (accuracy)	NER (micro-F1)	QA (EM/F1)	BangLUE score
mBERT	180M	67.59	75.13	68.97	67.12/72.64	70.29
XLM-R (base)	270M	69.54	78.46	73.32	68.09/74.27	72.82
XLM-R (large)	550M	70.97	82.40	78.39	73.15/79.06	76.79
sahajBERT	18M	71.12	76.92	70.94	65.48/70.69	71.03
BanglishBERT	110M	70.61	80.95	76.28	72.43/78.40	75.73
BanglaBERT (small)	13M	69.29	76.75	73.41	63.30/69.65	70.38
BanglaBERT	110M	72.89	82.80	77.78	72.63/79.34	77.09
BanglaBERT (large)	335M	71.94	83.41	79.20	76.10/81.50	78.43

The benchmarking datasets are as follows:

Acknowledgements

We would like to thank Intelligent Machines and Google TFRC Program for providing cloud support for pretraining the models.

License

Contents of this repository are restricted to non-commercial research purposes only under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).

Citation

If you use any of the datasets, models or code modules, please cite the following paper:

@inproceedings{bhattacharjee-etal-2022-banglabert,
    title = "{B}angla{BERT}: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in {B}angla",
    author = "Bhattacharjee, Abhik  and
      Hasan, Tahmid  and
      Ahmad, Wasi  and
      Mubasshir, Kazi Samin  and
      Islam, Md Saiful  and
      Iqbal, Anindya  and
      Rahman, M. Sohel  and
      Shahriyar, Rifat",
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-naacl.98",
    pages = "1318--1327",
    abstract = "In this work, we introduce BanglaBERT, a BERT-based Natural Language Understanding (NLU) model pretrained in Bangla, a widely spoken yet low-resource language in the NLP literature. To pretrain BanglaBERT, we collect 27.5 GB of Bangla pretraining data (dubbed {`}Bangla2B+{'}) by crawling 110 popular Bangla sites. We introduce two downstream task datasets on natural language inference and question answering and benchmark on four diverse NLU tasks covering text classification, sequence labeling, and span prediction. In the process, we bring them under the first-ever Bangla Language Understanding Benchmark (BLUB). BanglaBERT achieves state-of-the-art results outperforming multilingual and monolingual models. We are making the models, datasets, and a leaderboard publicly available at \url{https://github.com/csebuetnlp/banglabert} to advance Bangla NLP.",
}

banglabert's People

Contributors

Stargazers

Watchers

banglabert's Issues

BanglaBERT is not loading due to problem in config.json file

Hi,
I have tried to use your BanglaBERT model from huggingface. I have used the below code. The error message is telling that there is some problem in your config.json file. Can you please fix this issue?

!pip install git+https://github.com/csebuetnlp/normalizer
from transformers import AutoModelForPreTraining, AutoTokenizer
from normalizer import normalize
import torch

model = AutoModelForPreTraining.from_pretrained("csebuetnlp/banglabert")
tokenizer_bbert = AutoTokenizer.from_pretrained("csebuetnlp/banglabert")

text = "আমি বিদ্যালয়ে যাই ।"
text = normalize(text)

tokenizer_bbert.tokenize(text)

The websites the dataset was scraped from?

As Alexa Web rankings shut down in May, 2022, (https://www.alexa.com/topsites/countries/BD), it is not possible to retrieve the names of the Bangladeshi websites used.

It would be really useful if the names of the fifty Bangladeshi websites used to scrape the dataset could be released. It would help understand the nature of the dataset used to train the model and help in model interpretability experiments too.

UnicodeEncodeError: 'charmap' codec can't encode characters in position 1210-1213: character maps to <undefined>

--I am trying to run the example evaluation given in this repo on my local machine, but this is the error I am getting. what can I do? below is the whole info.

$python ./question_answering/question_answering.py --model_name_or_path "csebuetnlp/banglabert" --dataset_dir "./question_answering/sample_inputs/" --output_dir "./question_answering/outputs/" --per_device_eval_batch_size=24 --overwrite_output_dir --do_predict
D:\Work\anaconda3\envs\banglabert\lib\site-packages\torch\cuda_init_.py:104: UserWarning:
NVIDIA GeForce RTX 3050 Laptop GPU with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.
If you want to use the NVIDIA GeForce RTX 3050 Laptop GPU GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
10/02/2023 23:42:15 - WARNING - main - Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: False
10/02/2023 23:42:15 - INFO - main - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=True,
do_train=False,
eval_accumulation_steps=None,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
greater_is_better=None,
group_by_length=False,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=./question_answering/outputs/runs\Oct02_23-42-14_WIN-849LR3KF8F4,
logging_first_step=False,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
output_dir=./question_answering/outputs/,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=24,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=outputs,
push_to_hub_organization=None,
push_to_hub_token=None,
remove_unused_columns=True,
report_to=[],
resume_from_checkpoint=None,
run_name=./question_answering/outputs/,
save_on_each_node=False,
save_steps=500,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
10/02/2023 23:42:15 - WARNING - datasets.builder - Using custom data configuration default-431d55e0f1c961a4
10/02/2023 23:42:15 - INFO - datasets.builder - Overwrite dataset info from restored data version.
10/02/2023 23:42:15 - INFO - datasets.info - Loading Dataset info from C:\Users\Administrator.cache\huggingface\datasets\qa_dataset_builder\default-431d55e0f1c961a4\0.0.0
10/02/2023 23:42:15 - WARNING - datasets.builder - Reusing dataset qa_dataset_builder (C:\Users\Administrator.cache\huggingface\datasets\qa_dataset_builder\default-431d55e0f1c961a4\0.0.0)
10/02/2023 23:42:15 - INFO - datasets.info - Loading Dataset info from C:\Users\Administrator.cache\huggingface\datasets\qa_dataset_builder\default-431d55e0f1c961a4\0.0.0
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 57.23it/s]
[INFO|configuration_utils.py:561] 2023-10-02 23:42:16,323 >> loading configuration file https://huggingface.co/csebuetnlp/banglabert/resolve/main/config.json from cache at C:\Users\Administrator/.cache\huggingface\transformers\60928dc4b87f5881692890e6541e6538f91588d2ea40cbbbdc04cfb2cb83a6b1.2388211ba94f448fcf40aef3c9526142a8c2f2a8fb4fce8a3801462f51b2bab5
[INFO|configuration_utils.py:598] 2023-10-02 23:42:16,326 >> Model config ElectraConfig {
"architectures": [
"ElectraForPreTraining"
],
"attention_probs_dropout_prob": 0.1,
"classifier_dropout": null,
"embedding_size": 768,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "electra",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"position_embedding_type": "absolute",
"summary_activation": "gelu",
"summary_last_dropout": 0.1,
"summary_type": "first",
"summary_use_proj": true,
"transformers_version": "4.11.0.dev0",
"type_vocab_size": 2,
"vocab_size": 32000
}

[INFO|configuration_utils.py:561] 2023-10-02 23:42:17,541 >> loading configuration file https://huggingface.co/csebuetnlp/banglabert/resolve/main/config.json from cache at C:\Users\Administrator/.cache\huggingface\transformers\60928dc4b87f5881692890e6541e6538f91588d2ea40cbbbdc04cfb2cb83a6b1.2388211ba94f448fcf40aef3c9526142a8c2f2a8fb4fce8a3801462f51b2bab5
[INFO|configuration_utils.py:598] 2023-10-02 23:42:17,544 >> Model config ElectraConfig {
"architectures": [
"ElectraForPreTraining"
],
"attention_probs_dropout_prob": 0.1,
"classifier_dropout": null,
"embedding_size": 768,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "electra",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"position_embedding_type": "absolute",
"summary_activation": "gelu",
"summary_last_dropout": 0.1,
"summary_type": "first",
"summary_use_proj": true,
"transformers_version": "4.11.0.dev0",
"type_vocab_size": 2,
"vocab_size": 32000
}

[INFO|tokenization_utils_base.py:1739] 2023-10-02 23:42:23,760 >> loading file https://huggingface.co/csebuetnlp/banglabert/resolve/main/vocab.txt from cache at C:\Users\Administrator/.cache\huggingface\transformers\65e95b847336b6bf69b37fdb8682a97e822799adcd9745dcf9bf44cfe4db1b9a.8f92ca2cf7e2eaa550b10c40331ae9bf0f2e40abe3b549f66a3d7f13bfc6de47
[INFO|tokenization_utils_base.py:1739] 2023-10-02 23:42:23,760 >> loading file https://huggingface.co/csebuetnlp/banglabert/resolve/main/tokenizer.json from cache at None
[INFO|tokenization_utils_base.py:1739] 2023-10-02 23:42:23,760 >> loading file https://huggingface.co/csebuetnlp/banglabert/resolve/main/added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1739] 2023-10-02 23:42:23,761 >> loading file https://huggingface.co/csebuetnlp/banglabert/resolve/main/special_tokens_map.json from cache at C:\Users\Administrator/.cache\huggingface\transformers\7820dfc553e8dfb8a1e82042b7d0d691c7a7cd1e30ed2974218f696e81c5f3b1.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d
[INFO|tokenization_utils_base.py:1739] 2023-10-02 23:42:23,761 >> loading file https://huggingface.co/csebuetnlp/banglabert/resolve/main/tokenizer_config.json from cache at C:\Users\Administrator/.cache\huggingface\transformers\76fa87a0ec9c34c9b15732bf7e06bced447feff46287b8e7d246a55d301784d7.b4f59cefeba4296760d2cf1037142788b96f2be40230bf6393d2fba714562485
[INFO|configuration_utils.py:561] 2023-10-02 23:42:25,458 >> loading configuration file https://huggingface.co/csebuetnlp/banglabert/resolve/main/config.json from cache at C:\Users\Administrator/.cache\huggingface\transformers\60928dc4b87f5881692890e6541e6538f91588d2ea40cbbbdc04cfb2cb83a6b1.2388211ba94f448fcf40aef3c9526142a8c2f2a8fb4fce8a3801462f51b2bab5
[INFO|configuration_utils.py:598] 2023-10-02 23:42:25,460 >> Model config ElectraConfig {
"architectures": [
"ElectraForPreTraining"
],
"attention_probs_dropout_prob": 0.1,
"classifier_dropout": null,
"embedding_size": 768,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "electra",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"position_embedding_type": "absolute",
"summary_activation": "gelu",
"summary_last_dropout": 0.1,
"summary_type": "first",
"summary_use_proj": true,
"transformers_version": "4.11.0.dev0",
"type_vocab_size": 2,
"vocab_size": 32000
}

[INFO|configuration_utils.py:561] 2023-10-02 23:42:26,158 >> loading configuration file https://huggingface.co/csebuetnlp/banglabert/resolve/main/config.json from cache at C:\Users\Administrator/.cache\huggingface\transformers\60928dc4b87f5881692890e6541e6538f91588d2ea40cbbbdc04cfb2cb83a6b1.2388211ba94f448fcf40aef3c9526142a8c2f2a8fb4fce8a3801462f51b2bab5
[INFO|configuration_utils.py:598] 2023-10-02 23:42:26,161 >> Model config ElectraConfig {
"architectures": [
"ElectraForPreTraining"
],
"attention_probs_dropout_prob": 0.1,
"classifier_dropout": null,
"embedding_size": 768,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "electra",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"position_embedding_type": "absolute",
"summary_activation": "gelu",
"summary_last_dropout": 0.1,
"summary_type": "first",
"summary_use_proj": true,
"transformers_version": "4.11.0.dev0",
"type_vocab_size": 2,
"vocab_size": 32000
}

[INFO|file_utils.py:1665] 2023-10-02 23:42:26,817 >> https://huggingface.co/csebuetnlp/banglabert/resolve/main/pytorch_model.bin not found in cache or force_download set to True, downloading to C:\Users\Administrator.cache\huggingface\transformers\tmpe23b9o4n
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 443M/443M [05:26<00:00, 1.36MB/s]
[INFO|file_utils.py:1669] 2023-10-02 23:47:54,041 >> storing https://huggingface.co/csebuetnlp/banglabert/resolve/main/pytorch_model.bin in cache at C:\Users\Administrator/.cache\huggingface\transformers\913ea71768a80ccdde3a9ab9a88cf2a93f37a52008896997655d0f63b0d0743a.8aaedac281b72dbb5296319c53be5a4e4a52339eded3f68d49201e140e221615
[INFO|file_utils.py:1677] 2023-10-02 23:47:54,042 >> creating metadata file for C:\Users\Administrator/.cache\huggingface\transformers\913ea71768a80ccdde3a9ab9a88cf2a93f37a52008896997655d0f63b0d0743a.8aaedac281b72dbb5296319c53be5a4e4a52339eded3f68d49201e140e221615
[INFO|modeling_utils.py:1279] 2023-10-02 23:47:54,044 >> loading weights file https://huggingface.co/csebuetnlp/banglabert/resolve/main/pytorch_model.bin from cache at C:\Users\Administrator/.cache\huggingface\transformers\913ea71768a80ccdde3a9ab9a88cf2a93f37a52008896997655d0f63b0d0743a.8aaedac281b72dbb5296319c53be5a4e4a52339eded3f68d49201e140e221615
[WARNING|modeling_utils.py:1516] 2023-10-02 23:47:54,822 >> Some weights of the model checkpoint at csebuetnlp/banglabert were not used when initializing ElectraForQuestionAnswering: ['discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.bias']

This IS expected if you are initializing ElectraForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing ElectraForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[WARNING|modeling_utils.py:1527] 2023-10-02 23:47:54,823 >> Some weights of ElectraForQuestionAnswering were not initialized from the model checkpoint at csebuetnlp/banglabert and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
10/02/2023 23:47:54 - WARNING - datasets.fingerprint - Parameter 'function'=<function main..normalize_example at 0x00000239F12A9EE8> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
10/02/2023 23:47:54 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at C:\Users\Administrator.cache\huggingface\datasets\qa_dataset_builder\default-431d55e0f1c961a4\0.0.0\cache-1c80317fa3b1799d.arrow
10/02/2023 23:47:54 - INFO - datasets.fingerprint - Parameter 'function'=<function main..normalize_example at 0x00000239F12A9EE8> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead.
10/02/2023 23:47:54 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at C:\Users\Administrator.cache\huggingface\datasets\qa_dataset_builder\default-431d55e0f1c961a4\0.0.0\cache-bdd640fb06671ad1.arrow
10/02/2023 23:47:54 - INFO - datasets.fingerprint - Parameter 'function'=<function main..normalize_example at 0x00000239F12A9EE8> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead.
10/02/2023 23:47:54 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at C:\Users\Administrator.cache\huggingface\datasets\qa_dataset_builder\default-431d55e0f1c961a4\0.0.0\cache-3eb13b9046685257.arrow
10/02/2023 23:47:54 - INFO - datasets.fingerprint - Parameter 'function'=<function main..prepare_validation_features at 0x00000239F11B7828> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead.
10/02/2023 23:47:54 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at C:\Users\Administrator.cache\huggingface\datasets\qa_dataset_builder\default-431d55e0f1c961a4\0.0.0\cache-23b8c1e9392456de.arrow
10/02/2023 23:47:58 - INFO - datasets.load - Found main folder for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/squad/squad.py at C:\Users\Administrator.cache\huggingface\modules\datasets_modules\metrics\squad
10/02/2023 23:47:58 - INFO - datasets.load - Found specific version folder for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/squad/squad.py at C:\Users\Administrator.cache\huggingface\modules\datasets_modules\metrics\squad\513bf9facd7f12b0871a3d74c6999c866ce28196c9cdb151dcf934848655d77e
10/02/2023 23:47:58 - INFO - datasets.load - Found script file from https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/squad/squad.py to C:\Users\Administrator.cache\huggingface\modules\datasets_modules\metrics\squad\513bf9facd7f12b0871a3d74c6999c866ce28196c9cdb151dcf934848655d77e\squad.py
10/02/2023 23:47:58 - INFO - datasets.load - Couldn't find dataset infos file at https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/squad/dataset_infos.json
10/02/2023 23:47:58 - INFO - datasets.load - Found metadata file for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/squad/squad.py at C:\Users\Administrator.cache\huggingface\modules\datasets_modules\metrics\squad\513bf9facd7f12b0871a3d74c6999c866ce28196c9cdb151dcf934848655d77e\squad.json
10/02/2023 23:47:58 - INFO - datasets.load - Found local import file from C:\Users\Administrator.cache\huggingface\datasets\downloads\188f3c7a325773b47d41f3e0f7ab9fd3cb20e597010b3a9d780c878dacc10ce3.6f69c3ff9e10aa1cbdc6e91d27e158ea86a785f54a36a9e964ef8b3b78cf3cd6.py at C:\Users\Administrator.cache\huggingface\modules\datasets_modules\metrics\squad\513bf9facd7f12b0871a3d74c6999c866ce28196c9cdb151dcf934848655d77e\evaluate.py
10/03/2023 00:11:50 - INFO - main - *** Predict ***
[INFO|trainer.py:521] 2023-10-03 00:11:50,847 >> The following columns in the test set don't have a corresponding argument in ElectraForQuestionAnswering.forward and have been ignored: offset_mapping, example_id.
[INFO|trainer.py:2181] 2023-10-03 00:11:50,932 >> ***** Running Prediction *****
[INFO|trainer.py:2183] 2023-10-03 00:11:50,932 >> Num examples = 2614
[INFO|trainer.py:2186] 2023-10-03 00:11:50,933 >> Batch size = 24
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 109/109 [05:48<00:00, 3.16s/it]10/03/2023 00:19:02 - INFO - utils - Post-processing 2504 example predictions split into 2614 features.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 2504/2504 [00:04<00:00, 504.38it/s]
10/03/2023 00:19:07 - INFO - utils - Saving predictions to ./question_answering/outputs/predict_predictions.json.493/2504 [00:04<00:00, 541.44it/s]
Traceback (most recent call last):
File "./question_answering/question_answering.py", line 617, in
main()
File "./question_answering/question_answering.py", line 599, in main
results = trainer.predict(predict_dataset, predict_examples)
File "D:\Work\anaconda3\envs\banglabert\question_answering\utils.py", line 426, in predict
predictions = self.post_process_function(predict_examples, predict_dataset, output.predictions, "predict")
File "./question_answering/question_answering.py", line 534, in post_processing_function
prefix=stage,
File "D:\Work\anaconda3\envs\banglabert\question_answering\utils.py", line 345, in postprocess_qa_predictions
writer.write(json.dumps(all_predictions, ensure_ascii=False, indent=4) + "\n")
File "D:\Work\anaconda3\envs\banglabert\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 1210-1213: character maps to
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 109/109 [05:55<00:00, 3.26s/it]

Vocab File

Can you provide the vocab file used for your tokenizers? Thanks.

Error while finetuning in google colab using GPU

Hi,

I want to finetune BanglaBERT for sequence classification.

I face the following error while running on google colab using GPU.
I don't face this error while running on google colab using CPU.

This error occurred while running the following command (the example of sequence classificaton from github):

python ./sequence_classification/sequence_classification.py --overwrite_output_dir --model_name_or_path "csebuetnlp/banglabert" --dataset_dir "./sequence_classification/sample_inputs/single_sequence/jsonl" --output_dir "./sequence_classification/outputs/" --learning_rate=2e-5 --warmup_ratio 0.1 --gradient_accumulation_steps 2 --weight_decay 0.1 --lr_scheduler_type "linear" --per_device_train_batch_size=16 --per_device_eval_batch_size=16 --max_seq_length 512 --logging_strategy "epoch" --save_strategy "epoch" --evaluation_strategy "epoch" --num_train_epochs=3 --do_train --do_eval

Error Traceback:

05/08/2022 09:24:21 - WARNING - __main__ - Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: False
05/08/2022 09:24:21 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=None,
evaluation_strategy=IntervalStrategy.EPOCH,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=2,
greater_is_better=None,
group_by_length=False,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=2e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=./sequence_classification/outputs/runs/May08_09-24-21_0da7ed02e26d,
logging_first_step=False,
logging_steps=500,
logging_strategy=IntervalStrategy.EPOCH,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
output_dir=./sequence_classification/outputs/,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=16,
per_device_train_batch_size=16,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=outputs,
push_to_hub_organization=None,
push_to_hub_token=None,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=./sequence_classification/outputs/,
save_on_each_node=False,
save_steps=500,
save_strategy=IntervalStrategy.EPOCH,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.1,
warmup_steps=0,
weight_decay=0.1,
)
05/08/2022 09:24:21 - WARNING - datasets.builder - Using custom data configuration default-1e09c73b0f004fd6
05/08/2022 09:24:21 - INFO - datasets.builder - Overwrite dataset info from restored data version.
05/08/2022 09:24:21 - INFO - datasets.info - Loading Dataset info from /root/.cache/huggingface/datasets/json/default-1e09c73b0f004fd6/0.0.0
05/08/2022 09:24:21 - WARNING - datasets.builder - Reusing dataset json (/root/.cache/huggingface/datasets/json/default-1e09c73b0f004fd6/0.0.0)
05/08/2022 09:24:21 - INFO - datasets.info - Loading Dataset info from /root/.cache/huggingface/datasets/json/default-1e09c73b0f004fd6/0.0.0
100% 3/3 [00:00<00:00, 886.31it/s]
[INFO|configuration_utils.py:561] 2022-05-08 09:24:22,163 >> loading configuration file https://huggingface.co/csebuetnlp/banglabert/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/60928dc4b87f5881692890e6541e6538f91588d2ea40cbbbdc04cfb2cb83a6b1.2388211ba94f448fcf40aef3c9526142a8c2f2a8fb4fce8a3801462f51b2bab5
[INFO|configuration_utils.py:598] 2022-05-08 09:24:22,164 >> Model config ElectraConfig {
  "architectures": [
    "ElectraForPreTraining"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "embedding_size": 768,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "electra",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "summary_activation": "gelu",
  "summary_last_dropout": 0.1,
  "summary_type": "first",
  "summary_use_proj": true,
  "transformers_version": "4.11.0.dev0",
  "type_vocab_size": 2,
  "vocab_size": 32000
}

[INFO|configuration_utils.py:561] 2022-05-08 09:24:23,954 >> loading configuration file https://huggingface.co/csebuetnlp/banglabert/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/60928dc4b87f5881692890e6541e6538f91588d2ea40cbbbdc04cfb2cb83a6b1.2388211ba94f448fcf40aef3c9526142a8c2f2a8fb4fce8a3801462f51b2bab5
[INFO|configuration_utils.py:598] 2022-05-08 09:24:23,955 >> Model config ElectraConfig {
  "architectures": [
    "ElectraForPreTraining"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "embedding_size": 768,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "electra",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "summary_activation": "gelu",
  "summary_last_dropout": 0.1,
  "summary_type": "first",
  "summary_use_proj": true,
  "transformers_version": "4.11.0.dev0",
  "type_vocab_size": 2,
  "vocab_size": 32000
}

[INFO|tokenization_utils_base.py:1739] 2022-05-08 09:24:29,230 >> loading file https://huggingface.co/csebuetnlp/banglabert/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/65e95b847336b6bf69b37fdb8682a97e822799adcd9745dcf9bf44cfe4db1b9a.8f92ca2cf7e2eaa550b10c40331ae9bf0f2e40abe3b549f66a3d7f13bfc6de47
[INFO|tokenization_utils_base.py:1739] 2022-05-08 09:24:29,230 >> loading file https://huggingface.co/csebuetnlp/banglabert/resolve/main/added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1739] 2022-05-08 09:24:29,230 >> loading file https://huggingface.co/csebuetnlp/banglabert/resolve/main/special_tokens_map.json from cache at /root/.cache/huggingface/transformers/7820dfc553e8dfb8a1e82042b7d0d691c7a7cd1e30ed2974218f696e81c5f3b1.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d
[INFO|tokenization_utils_base.py:1739] 2022-05-08 09:24:29,230 >> loading file https://huggingface.co/csebuetnlp/banglabert/resolve/main/tokenizer_config.json from cache at /root/.cache/huggingface/transformers/76fa87a0ec9c34c9b15732bf7e06bced447feff46287b8e7d246a55d301784d7.b4f59cefeba4296760d2cf1037142788b96f2be40230bf6393d2fba714562485
[INFO|tokenization_utils_base.py:1739] 2022-05-08 09:24:29,230 >> loading file https://huggingface.co/csebuetnlp/banglabert/resolve/main/tokenizer.json from cache at None
[INFO|configuration_utils.py:561] 2022-05-08 09:24:30,126 >> loading configuration file https://huggingface.co/csebuetnlp/banglabert/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/60928dc4b87f5881692890e6541e6538f91588d2ea40cbbbdc04cfb2cb83a6b1.2388211ba94f448fcf40aef3c9526142a8c2f2a8fb4fce8a3801462f51b2bab5
[INFO|configuration_utils.py:598] 2022-05-08 09:24:30,126 >> Model config ElectraConfig {
  "architectures": [
    "ElectraForPreTraining"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "embedding_size": 768,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "electra",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "summary_activation": "gelu",
  "summary_last_dropout": 0.1,
  "summary_type": "first",
  "summary_use_proj": true,
  "transformers_version": "4.11.0.dev0",
  "type_vocab_size": 2,
  "vocab_size": 32000
}

[INFO|modeling_utils.py:1279] 2022-05-08 09:24:31,075 >> loading weights file https://huggingface.co/csebuetnlp/banglabert/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/913ea71768a80ccdde3a9ab9a88cf2a93f37a52008896997655d0f63b0d0743a.8aaedac281b72dbb5296319c53be5a4e4a52339eded3f68d49201e140e221615
[WARNING|modeling_utils.py:1516] 2022-05-08 09:24:32,600 >> Some weights of the model checkpoint at csebuetnlp/banglabert were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense_prediction.bias']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[WARNING|modeling_utils.py:1527] 2022-05-08 09:24:32,600 >> Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at csebuetnlp/banglabert and are newly initialized: ['classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
05/08/2022 09:24:32 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-1e09c73b0f004fd6/0.0.0/cache-c8c752bb15628b86.arrow
05/08/2022 09:24:32 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-1e09c73b0f004fd6/0.0.0/cache-1d7e8a13339dd538.arrow
05/08/2022 09:24:32 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-1e09c73b0f004fd6/0.0.0/cache-5b734993f8fa5b18.arrow
05/08/2022 09:24:33 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-1e09c73b0f004fd6/0.0.0/cache-ae957e77cc0e01d1.arrow
05/08/2022 09:24:33 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-1e09c73b0f004fd6/0.0.0/cache-ad37b78f61cc4fc6.arrow
05/08/2022 09:24:33 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-1e09c73b0f004fd6/0.0.0/cache-efbe758578e42091.arrow
05/08/2022 09:24:33 - INFO - __main__ - Sample 0 of the training set: {'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [2, 4992, 10267, 784, 27147, 415, 830, 7761, 1333, 16, 983, 12484, 825, 5083, 2893, 426, 2636, 16493, 415, 815, 2068, 795, 205, 3], 'label': 0, 'sentence1': 'যেই মাদারির পোলারা এই কাজটি করেছে, সেই সালারা অবৈধ জারপ সন্তান ছারা আর কিছুই না।', 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}.
05/08/2022 09:24:33 - INFO - __main__ - Sample 3 of the training set: {'attention_mask': [1, 1, 1, 1, 1, 1, 1], 'input_ids': [2, 10634, 5452, 817, 972, 6037, 3], 'label': 0, 'sentence1': 'মুসা কপা\u200cলে কি আ\u200cছে জা\u200cনিনা', 'token_type_ids': [0, 0, 0, 0, 0, 0, 0]}.
05/08/2022 09:24:33 - INFO - __main__ - Sample 1 of the training set: {'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [2, 2157, 18812, 16332, 12062, 16135, 1292, 3], 'label': 0, 'sentence1': 'ভারতের কুখ্যাত ষড়যন্ত্রের মুখোশ উন্মোচন হলো', 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0]}.
05/08/2022 09:24:35 - INFO - datasets.load - Found main folder for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/accuracy/accuracy.py at /root/.cache/huggingface/modules/datasets_modules/metrics/accuracy
05/08/2022 09:24:35 - INFO - datasets.load - Found specific version folder for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/accuracy/accuracy.py at /root/.cache/huggingface/modules/datasets_modules/metrics/accuracy/6dba4616f6b2bbd19659d1db3a48cc001c8f13a10cdc73a2641a55f7a60b7b5b
05/08/2022 09:24:35 - INFO - datasets.load - Found script file from https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/accuracy/accuracy.py to /root/.cache/huggingface/modules/datasets_modules/metrics/accuracy/6dba4616f6b2bbd19659d1db3a48cc001c8f13a10cdc73a2641a55f7a60b7b5b/accuracy.py
05/08/2022 09:24:35 - INFO - datasets.load - Couldn't find dataset infos file at https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/accuracy/dataset_infos.json
05/08/2022 09:24:35 - INFO - datasets.load - Found metadata file for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/accuracy/accuracy.py at /root/.cache/huggingface/modules/datasets_modules/metrics/accuracy/6dba4616f6b2bbd19659d1db3a48cc001c8f13a10cdc73a2641a55f7a60b7b5b/accuracy.json
05/08/2022 09:24:36 - INFO - datasets.load - Found main folder for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/precision/precision.py at /root/.cache/huggingface/modules/datasets_modules/metrics/precision
05/08/2022 09:24:36 - INFO - datasets.load - Found specific version folder for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/precision/precision.py at /root/.cache/huggingface/modules/datasets_modules/metrics/precision/94709a71c6fe37171ef49d3466fec24dee9a79846c9f176dff66a649e9811690
05/08/2022 09:24:36 - INFO - datasets.load - Found script file from https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/precision/precision.py to /root/.cache/huggingface/modules/datasets_modules/metrics/precision/94709a71c6fe37171ef49d3466fec24dee9a79846c9f176dff66a649e9811690/precision.py
05/08/2022 09:24:36 - INFO - datasets.load - Couldn't find dataset infos file at https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/precision/dataset_infos.json
05/08/2022 09:24:36 - INFO - datasets.load - Found metadata file for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/precision/precision.py at /root/.cache/huggingface/modules/datasets_modules/metrics/precision/94709a71c6fe37171ef49d3466fec24dee9a79846c9f176dff66a649e9811690/precision.json
05/08/2022 09:24:38 - INFO - datasets.load - Found main folder for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/recall/recall.py at /root/.cache/huggingface/modules/datasets_modules/metrics/recall
05/08/2022 09:24:38 - INFO - datasets.load - Found specific version folder for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/recall/recall.py at /root/.cache/huggingface/modules/datasets_modules/metrics/recall/1e3b93e2bed42e1498e628f161d79ee019dd8e78d36985d3c7ecbc018adf35e8
05/08/2022 09:24:38 - INFO - datasets.load - Found script file from https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/recall/recall.py to /root/.cache/huggingface/modules/datasets_modules/metrics/recall/1e3b93e2bed42e1498e628f161d79ee019dd8e78d36985d3c7ecbc018adf35e8/recall.py
05/08/2022 09:24:38 - INFO - datasets.load - Couldn't find dataset infos file at https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/recall/dataset_infos.json
05/08/2022 09:24:38 - INFO - datasets.load - Found metadata file for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/recall/recall.py at /root/.cache/huggingface/modules/datasets_modules/metrics/recall/1e3b93e2bed42e1498e628f161d79ee019dd8e78d36985d3c7ecbc018adf35e8/recall.json
05/08/2022 09:24:39 - INFO - datasets.load - Found main folder for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/f1/f1.py at /root/.cache/huggingface/modules/datasets_modules/metrics/f1
05/08/2022 09:24:39 - INFO - datasets.load - Found specific version folder for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/f1/f1.py at /root/.cache/huggingface/modules/datasets_modules/metrics/f1/6c86fddbf90432b9c43a8c38c62a0dd9de63bad2ef0a896f9ae20273d6d6f6e9
05/08/2022 09:24:39 - INFO - datasets.load - Found script file from https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/f1/f1.py to /root/.cache/huggingface/modules/datasets_modules/metrics/f1/6c86fddbf90432b9c43a8c38c62a0dd9de63bad2ef0a896f9ae20273d6d6f6e9/f1.py
05/08/2022 09:24:39 - INFO - datasets.load - Couldn't find dataset infos file at https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/f1/dataset_infos.json
05/08/2022 09:24:39 - INFO - datasets.load - Found metadata file for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/f1/f1.py at /root/.cache/huggingface/modules/datasets_modules/metrics/f1/6c86fddbf90432b9c43a8c38c62a0dd9de63bad2ef0a896f9ae20273d6d6f6e9/f1.json
[INFO|trainer.py:521] 2022-05-08 09:24:43,888 >> The following columns in the training set  don't have a corresponding argument in `ElectraForSequenceClassification.forward` and have been ignored: sentence1.
[INFO|trainer.py:1168] 2022-05-08 09:24:43,900 >> ***** Running training *****
[INFO|trainer.py:1169] 2022-05-08 09:24:43,900 >>   Num examples = 4
[INFO|trainer.py:1170] 2022-05-08 09:24:43,900 >>   Num Epochs = 3
[INFO|trainer.py:1171] 2022-05-08 09:24:43,900 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:1172] 2022-05-08 09:24:43,900 >>   Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:1173] 2022-05-08 09:24:43,900 >>   Gradient Accumulation steps = 2
[INFO|trainer.py:1174] 2022-05-08 09:24:43,900 >>   Total optimization steps = 3
  0% 0/3 [00:00<?, ?it/s]Traceback (most recent call last):
  File "./sequence_classification/sequence_classification.py", line 479, in <module>
    main()
  File "./sequence_classification/sequence_classification.py", line 426, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1284, in train
    tr_loss += self.training_step(model, inputs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1789, in training_step
    loss = self.compute_loss(model, inputs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1821, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/electra/modeling_electra.py", line 973, in forward
    return_dict,
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/electra/modeling_electra.py", line 879, in forward
    input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/electra/modeling_electra.py", line 206, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/sparse.py", line 160, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 2183, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)
  0% 0/3 [00:00<?, ?it/s]

Probable solution from pytorch discussion forum which I can't figure out: https://discuss.pytorch.org/t/code-that-loads-sgd-fails-to-load-adam-state-to-gpu/61783/3?u=shaibagon

Thanks.

Training Corpus release?

Will the training corpus be released?

Thanks in advance.

Can I extract word embeddings using BanglaBERT ?

Hi,
Is it possible to extract/generate word embeddings using BanglaBERT?
I have tokenized my Bangla sentence using BanglaBERT. Now I want to generate Word Embeddings from my tokenized sentence.

!pip install transformers
!pip install git+https://github.com/csebuetnlp/normalizer

from transformers import AutoModelForPreTraining, AutoTokenizer
from normalizer import normalize
import torch

model = AutoModelForPreTraining.from_pretrained("csebuetnlp/banglabert")
tokenizer_bbert = AutoTokenizer.from_pretrained("csebuetnlp/banglabert")


text = 'দেশদ্রোহিতার মামলা স্বর্ণ মন্দিরের ভিতর ও বৈশাখী উৎসবের মিছিলে খলিস্তানপন্থী স্লোগান দেওয়ার জন্য কয়েকজন বিশ্ব যুবকের বিরুদ্ধে দেশদ্রোহিতার মামলা দায়ের করা হয়েছে ।'

text = normalize(text)

text = tokenizer_bbert.tokenize(text)

print(text)

# >>  ['দেশদ্রোহ', '##িতার', 'মামলা', 'স্বর্ণ', 'মন্দিরের', 'ভিতর', 'ও', 'বৈশাখী', 'উৎসবের', 'মিছিলে', 'খলি', '##স্তান', '##পন্থী', 'স্লোগান', 'দেওয়ার','জন্য', 'কয়েকজন', 'বিশ্ব', 'যুবকের', 'বিরুদ্ধে', 'দেশদ্রোহ', '##িতার', 'মামলা', 'দায়ের', 'করা', 'হয়েছে', '।']

I have find out how to generate Word Embeddings using BERT. Here is the link (https://discuss.huggingface.co/t/generate-raw-word-embeddings-using-transformer-models-like-bert-for-downstream-process/2958).
Will it be same for BanglaBERT or Bangla Language or it will be better to use a different Bangla Language specific approach?

Any kind of suggestion or advice will be helpful for me. Thanks in advance.

add web demo/models/datasets to NAACL 2022 organization on Hugging Face

Hi, congrats for the acceptance at NAACL 2022. We are having an event on Hugging Face for NAACL 2022, where you can submit spaces(web demos), models, and datasets for papers for a chance to win prizes. Hugging Hub works similar to github where you can push to user profiles or organization accounts, you can add the models/datasets and spaces to this organization: https://huggingface.co/NAACL2022, after joining the organization using this link https://huggingface.co/organizations/NAACL2022/share/FnuCfwNhiIRWAlngiEkLcwuUrMDMTCPbje, let me know if you need any help with the above steps, thanks.

Also I see you already have models and datasets here: https://huggingface.co/csebuetnlp, would be great to have as a submission to the event, you can simply clone and push this to the NAACL 2022 organization