T5 Model for Farsi.
The process of training is briefly as follows - generally from transformers examples:
We demonstrate how to train a T5 model using the span-masked language model objective as proposed in the Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. More specifically, we demonstrate how JAX/Flax can be leveraged to pre-train google/t5-v1_1-base
in Farsi on a single GPU (NVIDIA GeForce RTX 3060) for ? hours.
Let's start by creating a model repository to save the trained model and logs.
Here we call the model "norwegian-t5-base"
, but you can change the model name as you like.
The default values will save the model in t5-farsi/
relative to the repository directory.
In the first step, we train a tokenizer to efficiently process the text input for the model. We make use of the tokenizers library to train a sentencepiece unigram tokenizer as shown in t5_tokenizer_model.py
which is heavily inspired from yandex-research/DeDLOC's tokenizer model .
The tokenizer is going to be trained on the complete Persian dataset of our datasets and consequently saved in the cloned model directory. The process of training the tokenizer is provided in t5_tokenizer_train.py
.
You can simply run it by the below command if you wanted to use OSCAR dataset:
python t5_tokenizer_train.py
or the alternative one if you wanted to use your own .txt
file:
python t5_tokenizer_train.py [TRAIN_TEXT_FILE] [CACHE_DIR]
Next, we create the model's configuration file. This is as simple as loading and storing **google/t5-v1_1-base**
in the local model folder. You can simply run the code by
python t5_config.py
Next we can run the example script to pretrain the model. For this section you may need to run the train.sh
file by:
bash train.sh
[Our result (accuracy and losss) should come here.]
For more details check out here.