Giter VIP home page Giter VIP logo

hinglish's Introduction

Hinglish

Tools and Data

Data and Model files

Logo

Approach LM Perplexity Classifier F1
BERT 8.2 0.63
DistilBERT 6.5 0.63
ULMFIT 21 0.61
RoBERTa 7.54 0.64

Model Performance

Base LM Dataset Accuracy Precision Recall F1 LM Perplexity
bert-base-multilingual-cased Test 0.688 0.698 0.686 0.687 8.2
bert-base-multilingual-cased Valid 0.62 0.592 0.605 0.55 8.2
distilbert-base-uncased Test 0.693 0.694 0.703 0.698 6.51
distilbert-base-uncased Valid 0.607 0.614 0.600 0.592 6.51
distilbert-base-multilingual-cased Test 0.612 0.615 0.616 0.616 8.1
distilbert-base-multilingual-cased Valid 0.55 0.531 0.537 0.495 8.1
roberta-base Test 0.630 0.629 0.644 0.635 7.54
roberta-base Valid 0.60 0.617 0.607 0.595 7.54
Ensemble Test 0.714 0.718 0.718 0.718

Ensemble Performace

Model Accuracy Precision Recall F1 Config Link to Model and output files
BERT 0.68866 0.69821 0.68608 0.6875 Batch Size - 16
Attention Dropout - 0.4
Learning Rate - 5e-07
Adam epsilon - 1e-08
Hidden Dropout Probability - 0.3
Epochs - 3
BERT
DistilBert 0.69333 0.69496 0.70379 0.6982 Batch Size - 16
Attention Dropout - 0.6
Learning Rate - 3e-05
Adam epsilon - 1e-08
Hidden Dropout Probability - 0.6
Epochs - 3
DistilBert
EnsembleBert1 0.69233 0.70236 0.69064 0.68952 Batch Size - 4
Attention Dropout - 0.7
Learning Rate - 5.01e-05
Adam epsilon - 4.79e-05
Hidden Dropout Probability - 0.1
Epochs - 3
EnsembleBert1
EnsembleBert2 0.691 0.7009 0.6889 0.68872 Batch Size - 4
Attention Dropout - 0.6
Learning Rate - 5.13e-05
Adam epsilon - 9.72e-05
Hidden Dropout Probability - 0.2
Epochs - 3
EnsembleBert2
EnsembleDistilBert1 0.70166 0.70377 0.70976 0.7061 Batch Size - 16
Attention Dropout - 0.8
Learning Rate - 3.02e-05
Adam epsilon - 9.35e-05
Hidden Dropout Probability - 0.4
Epochs - 3
EnsembleDistilBert1
EnsembleDistilBert2 0.689 0.691 0.69666 0.69335 Batch Size - 4
Attention Dropout - 0.6
Learning Rate - 5.13e-05
Adam epsilon - 9.72e-05
Hidden Dropout Probability - 0.2
Epochs - 3
EnsembleDistilBert2
EnsembleDistilBert3 0.69366 0.69538 0.70557 0.69905 Batch Size - 16
Attention Dropout - 0.4
Learning Rate - 4.74e-05
Adam epsilon - 4.09e-05
Hidden Dropout Probability - 0.6
Epochs - 3
EnsembleDistilBert3
Ensemble 0.71466 0.71867 0.71853 0.7182 NA Ensemble

hinglish's People

Contributors

meghanabhange avatar nirantk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

hinglish's Issues

Annotation process on Hinglish dataset

Hello guys,
I am trying to know how to annotate Hinglish data that should be useful for finetune language models or building a language model. I saw your train.json file on that text is tweet, and clean_text is preprocessed text. And what about Hindi words in the tweet? Are Hindi words tagged? I mean, I read in an article they were annotating Hinglish tweets with corresponding English version sentences. In another article, they were mentioned like each word in Hinglish sentence tagged with corresponding language and whole sentence tagged with either sentiment classification labels or any other classification problem labels.

I just wanted to know what process we were using for training data annotation purposes. Can you please tell me?

Find Top 200 samples to be most likely wrong using Cleanlab

We will use these results to manually verify how much to trust this dataset labels itself

E.g. if the error is say, more than 5% - we will pass this value during test time prediction, which will have the same label error rate hopefully

Remove URL artifacts

The people who preprocessed this data, don't know how to use regex to remove URLs completely.

They just removed the special characters, without removing the URL itself: https t co tsrsbu

Use Macro F1 score with 5-fold CV instead of accuracy for all measurements

  • Use a randomized 5-fold CV, with a fixed seed state instead of a test split
  • Measure F1, and Accuracy - just accuracy has no meaning
  • Always print the confusion matrix via sklearn, or go all the way and seaborn it ๐Ÿฅ‡

Given above, would it be more useful to this within the notebook instead of a separate wonky CleanTwitter class?

Remove Filipino and non-Hinglish data

They have included Filipino tweets in a dataset of Hinglish tweets ๐Ÿคฆโ€โ™‚

I wonder how did they assign it a sentiment - do they speak Filipino too?

The truncated tweets also indicate that this was longer 280 char tweet, which they truncated to 140 characters.

Example:

happy birthday seatmate thank you kasi masginanahan ako magaral kasi katabi ko kayo ni dom hahahhah thank you are https t co jyeeskiyf

which translates to:

happy birthday seatmate thank you for helping me study because i was with you hahahhah thank you are https t co jyeeskiyf

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.