Bengali Text Generation using GPT

GPT, or Generative Pre-trained Transformer, is a state-of-the-art natural language processing model developed by OpenAI. It is a deep learning model trained on a massive amount of text data and is capable of generating human-like text with impressive coherence and fluency. This repo delves into training a GPT model on a tiny Bengali Corpus.

Trained on Prothom Alo Opinion Section(2011-2019)

Directory Structure
├── artifact                                                             - Artifact directory(will be created automatically)
│   └── exp-1                                                            - Experiment directory
│       ├── model_step_41600_epoch_099_loss_0.024063_vloss_0.025048.pt   - Saved model
│       ├── config.py                                                    - Experiment config
│       ├── events.out.tfevents                                          - Tensorboard logger
│       └── training.log                                                 - Loggings
├── config.py                                                            - config file
├── text_data
│   └── prothom_alo_opnion_section(2011-2019).zip                        - Text data, unzip it
├── datagen.py                                                           - Dataloader
├── main.py                                                              - Script for Training
├── modeling                                                             - Model Directory
│   └── gpt.py                                                           - GPT model
├── readme.md                                                            - Documentation
├── requirements.txt                                                     - Requirement File
├── sample_prediction.txt
├── text_generation_demo.py                                              - Text Generation Demo
├── text_processor.py                                                    - Tokenizer
├── tokenizer                                                            - Vocabulary Directory
│   └── vocab_complete
│       ├── tokenizer.8000.model
│       └── tokenizer.8000.vocab
├── trainer.py                                                           - Trainer Script
└── utils.py                                                             - Uility Functions

Quick setup

Requires git, python, and conda.

Clone this project:

git clone https://github.com/Jahid006/BnTextGenerationGPT.git

Install dependencies:
```
pip install -r requirements.txt
```

To Train:

Update configuration
Train a Tokenizer

import config as cfg
from text_processor import Tokenizer
sp_tokenizer = Tokenizer(
    sentencepiece_path=cfg.TOKENIZER_PATH,
    max_len=cfg.TOKENIZER_MAX_SEQ,
    vocab_size=cfg.TOKENIZER_VOCAB_SIZE
)
sp_tokenizer.train(cfg.DATA_PATH)

Or you can load a pretrained Tokenizer

import config as cfg
from text_processor import Tokenizer
sp_tokenizer = Tokenizer(
    sentencepiece_path=cfg.TOKENIZER_PATH,
    max_len=cfg.TOKENIZER_MAX_SEQ,
    vocab_size=cfg.TOKENIZER_VOCAB_SIZE
)
sp_tokenizer.load()

Define your Torch Train/Validation Dataset object

import config as cfg
from datagen import (
    DataGenerator,
    text_preprocessor as preprocessor
)
generator = DataGenerator(
    data,
    tokenizer=sp_tokenizer,
    preprocessor=preprocessor,
    max_len=cfg.MAX_LEN,
)

Define Dataloader object with the Dataset
Define Optimizer, Schedular, Criterion
Train the model

import config as cfg
import trainer 
from torch.utils.tensorboard import SummaryWriter
model = trainer.train(
    model=model,
    optimizer=optimizer,
    scheduler=scheduler,
    criterion=criterion,
    train_dataset=train_dataloader,
    test_dataset=val_dataloader,
    device=device,
    epochs=cfg.EPOCHS,
    summary_writter=SummaryWriter(cfg.MODEL_SAVE_DIR),
    logging=logging,
    saving_step=1500,
    steps_per_epoch=len(train_generator),
    model_saving_dir=cfg.MODEL_SAVE_DIR
)

main.py has a complete walkthrough of training process

Follow text_generation_demo.py for inference

Sample Prediction

import config as cfg

text = "জেনারেল এরশাদের সিদ্ধান্ত অমান্য করে তাঁর স্ত্রী রওশন এরশাদ যে আওয়ামী লীগের সঙ্গে সমঝোতা করে ফেলেছিলেন, সে কথাও সবার জানা। রওশন এরশাদের সিদ্ধান্তেই সেদিন জাতীয় পার্টির অন্য নেতারা মনোনয়নপত্র জমা দিয়েছেন এবং নির্বাচনে প্রতিদ্বন্দ্বিতা করেছেন। জেনারেল এরশাদের নামে যেসব আসনে মনোনয়নপত্র জমা পড়েছিল, সেগুলোর যথার্থতা নিয়েও তখন প্রশ্ন উঠেছিল।"
output = text_generation_demo.generate_text(
    model=model,
    tokenizer=sp_tokenizer,
    prefix_text=text,
    max_seq_len=cfg.MAX_LEN,
    max_generated_text=64,
    device=device
)
output = "এরশাদের পতনের পর বিএনপি-জামায়াতের নেতারা একপর্যায়ে আওয়ামী লীগকে ক্ষমতার অংশী করেছিলেন। এরশাদ ক্ষমতার অংশীদার ছিলেন।"

Coming Soon

Documentation

jahid006 / benaglitextgenerationusinggpt Goto Github PK

benaglitextgenerationusinggpt's Introduction

Bengali Text Generation using GPT

Trained on Prothom Alo Opinion Section(2011-2019)

Quick setup

Coming Soon

benaglitextgenerationusinggpt's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent