Hi, i'm replicating your training shell just like readme said with sh train_bart_model.sh command. And this error apear al the end.
{'loss': 0.0193, 'grad_norm': 0.06763239204883575, 'learning_rate': 2.983362019506598e-06, 'epoch': 2.98}
{'loss': 0.0197, 'grad_norm': 0.07400441914796829, 'learning_rate': 1.835915088927137e-06, 'epoch': 2.99}
{'loss': 0.0201, 'grad_norm': 0.07796286791563034, 'learning_rate': 6.884681583476765e-07, 'epoch': 2.99}
{'train_runtime': 8481.4381, 'train_samples_per_second': 105.243, 'train_steps_per_second': 0.411, 'train_loss': 0.0540917304825899, 'epoch': 3.0}
100%|██████████████████████████████████████████████████████████████████████████████| 3486/3486 [2:21:21<00:00, 2.43s/it]
[WARNING|configuration_utils.py:447] 2024-03-30 18:17:42,674 >> Some non-default generation parameters are set in the model config. These should go into a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model) instead. This warning will be raised to an exception in v4.41.
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
***** train metrics *****
epoch = 3.0
train_loss = 0.0541
train_runtime = 2:21:21.43
train_samples = 297536
train_samples_per_second = 105.243
train_steps_per_second = 0.411
100%|████████████████████████████████████████████████████████████████████████████████████| 63/63 [26:08<00:00, 21.08s/it]Traceback (most recent call last):
File "/var/www/nlp/spelling/run_summarization.py", line 708, in
main()
File "/var/www/nlp/spelling/run_summarization.py", line 650, in main
metrics = trainer.evaluate(max_length=max_length, num_beams=num_beams, metric_key_prefix="eval")
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 180, in evaluate
return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/trainer.py", line 3365, in evaluate
output = eval_loop(
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/trainer.py", line 3656, in evaluation_loop
metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
File "/var/www/nlp/spelling/run_summarization.py", line 590, in compute_metrics
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3785, in batch_decode
return [
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3786, in
self.decode(
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3825, in decode
return self._decode(
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 625, in _decode
text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
OverflowError: out of range integral type conversion attempted
100%|████████████████████████████████████████████████████████████████████████████████████| 63/63 [26:09<00:00, 24.91s/it]
My enviroment is ubuntu 20.04 , 32GB RAM 48Cores, RTX4080.
Sat Mar 30 16:50:57 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:81:00.0 On | N/A |
| 53% 68C P2 196W / 320W | 10981MiB / 16376MiB | 79% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3901 G /usr/lib/xorg/Xorg 144MiB |
| 0 N/A N/A 4070 G /usr/bin/gnome-shell 66MiB |
| 0 N/A N/A 6775 G ...3/usr/lib/firefox/firefox 11MiB |
| 0 N/A N/A 15644 G ...on=20240329-134507.235000 58MiB |
| 0 N/A N/A 32210 C python 10694MiB |
+-----------------------------------------------------------------------------+
And mi last checkpoint was: 3000.
I don't know if the 3 process was finish.
Thanks in advance
Martín