t5-fa's Introduction

t5-fa

T5 Model for Farsi.

How to use

Training

The process of training is briefly as follows - generally from transformers examples:

We demonstrate how to train a T5 model using the span-masked language model objective as proposed in the Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. More specifically, we demonstrate how JAX/Flax can be leveraged to pre-train google/t5-v1_1-base in Farsi on a single GPU (NVIDIA GeForce RTX 3060) for ? hours.

Let's start by creating a model repository to save the trained model and logs. Here we call the model "norwegian-t5-base", but you can change the model name as you like.

The default values will save the model in t5-farsi/ relative to the repository directory.

Training the tokenizer

In the first step, we train a tokenizer to efficiently process the text input for the model. We make use of the tokenizers library to train a sentencepiece unigram tokenizer as shown in t5_tokenizer_model.py which is heavily inspired from yandex-research/DeDLOC's tokenizer model .

The tokenizer is going to be trained on the complete Persian dataset of our datasets and consequently saved in the cloned model directory. The process of training the tokenizer is provided in t5_tokenizer_train.py.

You can simply run it by the below command if you wanted to use OSCAR dataset:

python t5_tokenizer_train.py

or the alternative one if you wanted to use your own .txt file:

python t5_tokenizer_train.py [TRAIN_TEXT_FILE] [CACHE_DIR]

Creating configuration

Next, we create the model's configuration file. This is as simple as loading and storing **google/t5-v1_1-base** in the local model folder. You can simply run the code by