Giter VIP home page Giter VIP logo

eliask93 / transformer-models-for-domain-specific-machine-translation Goto Github PK

View Code? Open in Web Editor NEW
6.0 2.0 0.0 1.8 MB

Example application for the task of fine-tuning pretrained machine translation models on highly domain-specific, self-extracted translated sentences

Python 100.00%
nlp machine-translation sentence-transformers bitext-mining sentence-extraction marian-nmt t5 nllb nllb200

transformer-models-for-domain-specific-machine-translation's Introduction

Transformer Models for Domain-Specific Machine Translation

Example application for the task of fine-tuning pretrained machine translation models on highly domain-specific translated sentences.

For this, likely translation pairs are first extracted from the original versions and the German book translations of the Harry Potter fantasy novel series using a Translated Sentence Mining approach. The extracted sentence translations are then used to fine-tune two baseline machine translation models (pre-trained model MarianMT for translation from English to German and Google's Text-To-Text Transfer Transformer T5).

Afterwards, some metrics are calculated to evaluate the performance boost from fine-tuning the models.


Overview of the procedure

I. Parallel Sentence Extraction (Bitext Mining)
  1. Split the unaligned txt files for each book and its translation file into sentences using Lingtrain Aligners splitter and preprocessor
  2. Calculate language-independent sentence level embeddings for the split sentences using GoogleAI's Language-Agnostic BERT Sentence Embeddings (LaBSE) in the Sentence Transformers framework
  3. Match the best fitting translation pairs for all sentences using K-Nearest Neighbors search, mostly following Sentence Transformers example application for Translated Sentence Mining
  4. Filter the sentence pairs by a minimum similarity score
  5. Remove sentence pairs containing sentences shorter than 20 or longer than 200 characters
  6. Split the resulting corpus of ~54.000 likely parallel sentences randomly into a train, validation and test set (80%, 10%, 10%)
II. Machine Translation Engine Training on Domain-Specific Corpus
  1. Load the pre-trained models Helsinki-NLP/opus-mt-en-de (MarianMTModel) and t5-base (T5ForConditionalGeneration) in huggingface
  2. Fine-tune the models on the extracted parallel sentences using the train and evaluation set for 10 epochs each (training time: 03h-04m-45s for MarianMT and 09h-20m-10s for T5 on NVIDIA GeForce GTX 1660 Ti)
III. Machine Translation Quality Evaluation
  1. Use the non-fine-tuned MarianMT and T5 models to get machine translations for a sample from the test set
  2. Use the fine-tuned models to get machine translations for a sample from the test set
  3. Calculate BLEU, METEOR and BertScore between references and the target language translations for each the non-fine-tuned and the fine-tuned models

Visualization of the procedure



Results

Model BLEU METEOR BertScore1
MarianMT (baseline) 0.256 0.433 0.597
MarianMT (fine-tuned) 0.388 0.552 0.717
T5-base (baseline) 0.166 0.307 0.309
T5-base (fine-tuned) 0.340 0.492 0.662

1: setting the parameter rescale_with_baseline to True


Requirements

- Python >= 3.8
- Conda
  • pytorch==1.7.1
  • cudatoolkit=10.1
  • pywin32
- pip
  • transformers
  • sentence_transformers
  • faiss-gpu
  • sacrebleu
  • datasets
  • bert-score
  • lingtrain-aligner
  • razdel
  • dateparser
  • python-dateutil
  • numpy
  • openpyxl

Notes

All files in this repository which contain text from the books are cut off after the first 50 rows. The trained model files pytorch_model.bin and optimizer.pt for each model are omitted in this repository.

transformer-models-for-domain-specific-machine-translation's People

Contributors

eliask93 avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.