GPT, or Generative Pre-trained Transformer, is a state-of-the-art natural language processing model developed by OpenAI. It is a deep learning model trained on a massive amount of text data and is capable of generating human-like text with impressive coherence and fluency. This repo delves into training a GPT model on a tiny Bengali Corpus.
Directory Structure
├── artifact - Artifact directory(will be created automatically)
│ └── exp-1 - Experiment directory
│ ├── model_step_41600_epoch_099_loss_0.024063_vloss_0.025048.pt - Saved model
│ ├── config.py - Experiment config
│ ├── events.out.tfevents - Tensorboard logger
│ └── training.log - Loggings
├── config.py - config file
├── text_data
│ └── prothom_alo_opnion_section(2011-2019).zip - Text data, unzip it
├── datagen.py - Dataloader
├── main.py - Script for Training
├── modeling - Model Directory
│ └── gpt.py - GPT model
├── readme.md - Documentation
├── requirements.txt - Requirement File
├── sample_prediction.txt
├── text_generation_demo.py - Text Generation Demo
├── text_processor.py - Tokenizer
├── tokenizer - Vocabulary Directory
│ └── vocab_complete
│ ├── tokenizer.8000.model
│ └── tokenizer.8000.vocab
├── trainer.py - Trainer Script
└── utils.py - Uility Functions
Requires git, python, and conda.
- Clone this project:
git clone https://github.com/Jahid006/BnTextGenerationGPT.git
- Install dependencies:
pip install -r requirements.txt
- To Train:
- Update configuration
- Train a Tokenizer
import config as cfg from text_processor import Tokenizer sp_tokenizer = Tokenizer( sentencepiece_path=cfg.TOKENIZER_PATH, max_len=cfg.TOKENIZER_MAX_SEQ, vocab_size=cfg.TOKENIZER_VOCAB_SIZE ) sp_tokenizer.train(cfg.DATA_PATH)
- Or you can load a pretrained Tokenizer
import config as cfg from text_processor import Tokenizer sp_tokenizer = Tokenizer( sentencepiece_path=cfg.TOKENIZER_PATH, max_len=cfg.TOKENIZER_MAX_SEQ, vocab_size=cfg.TOKENIZER_VOCAB_SIZE ) sp_tokenizer.load()
- Define your Torch Train/Validation Dataset object
import config as cfg from datagen import ( DataGenerator, text_preprocessor as preprocessor ) generator = DataGenerator( data, tokenizer=sp_tokenizer, preprocessor=preprocessor, max_len=cfg.MAX_LEN, )
- Define Dataloader object with the Dataset
- Define Optimizer, Schedular, Criterion
- Train the model
import config as cfg import trainer from torch.utils.tensorboard import SummaryWriter model = trainer.train( model=model, optimizer=optimizer, scheduler=scheduler, criterion=criterion, train_dataset=train_dataloader, test_dataset=val_dataloader, device=device, epochs=cfg.EPOCHS, summary_writter=SummaryWriter(cfg.MODEL_SAVE_DIR), logging=logging, saving_step=1500, steps_per_epoch=len(train_generator), model_saving_dir=cfg.MODEL_SAVE_DIR )
- main.py has a complete walkthrough of training process
- Follow text_generation_demo.py for inference
- Sample Prediction
import config as cfg text = "জেনারেল এরশাদের সিদ্ধান্ত অমান্য করে তাঁর স্ত্রী রওশন এরশাদ যে আওয়ামী লীগের সঙ্গে সমঝোতা করে ফেলেছিলেন, সে কথাও সবার জানা। রওশন এরশাদের সিদ্ধান্তেই সেদিন জাতীয় পার্টির অন্য নেতারা মনোনয়নপত্র জমা দিয়েছেন এবং নির্বাচনে প্রতিদ্বন্দ্বিতা করেছেন। জেনারেল এরশাদের নামে যেসব আসনে মনোনয়নপত্র জমা পড়েছিল, সেগুলোর যথার্থতা নিয়েও তখন প্রশ্ন উঠেছিল।" output = text_generation_demo.generate_text( model=model, tokenizer=sp_tokenizer, prefix_text=text, max_seq_len=cfg.MAX_LEN, max_generated_text=64, device=device ) output = "এরশাদের পতনের পর বিএনপি-জামায়াতের নেতারা একপর্যায়ে আওয়ামী লীগকে ক্ষমতার অংশী করেছিলেন। এরশাদ ক্ষমতার অংশীদার ছিলেন।"
- Documentation