Giter VIP home page Giter VIP logo

cmixaugment-mt's Introduction

CMixAugment-MT: An Empirical Study of Leveraging Finetuning of Machine Translation Model To Produce Synthetic Code Mixing Data For Data Augmentation

This is my NLP804 (Deep Learning For Natural Language Generation) project. The goal here is to evaluate different finetuning strategies on machine translation model for producing better code-mixed synthetic dataset. Due to availability on the parallel corpus, I only conducted the studies for Hinglish (Hindi-English) translation from English.

Setup

Setup conda environment

conda create -n cmixaugment-mt python=3.10
conda activate cmixaugment-mt

Install all dependencies

pip install -r requirements.txt

How to run experiment

This project mainly uses hydra library to store the configuration. To run the experiment you can run this command

python main.py --config-name <adapter_config/baseline_config/sft_config>

Evaluation

We use various evaluation metrics to assess the quality of the generated synthetic data. For downstream task evaluation, you can check the eval/GLUECoS on NLI and Sentiment Analysis (note that training file other than baseline, we add sample ratio information before the extension in the filename). While for the code-mixing metrics, you can check the eval/LID-tool and the main file is eval/LID-tool/measure_cmi.py. For evaluation using LLM-as-a-judge, you can check eval/LLM-eval and we use llm_eval.py to evaluate the text quality using specific LLM and then merge all the LLMs' judgment using merge_llm_scores.py. Lastly, for reference-free metrics you can check both eval/eval_comet_score.py and eval/eval_lbse_sim_score.py.

Experiment Assets

You can check out this link

cmixaugment-mt's People

Contributors

codefire53 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.