nirantk / hinglish Goto Github PK

View Code? Open in Web Editor NEW

31.0 3.0 10.0 54.37 MB

Hinglish Text Classification

License: MIT License

Jupyter Notebook 57.53% Python 42.47%

natural-language-processing text-classification sentiment-analysis text-mining

hinglish's Introduction

Hinglish

Tools and Data

Data and Model files

Approach	LM Perplexity	Classifier F1
BERT	8.2	0.63
DistilBERT	6.5	0.63
ULMFIT	21	0.61
RoBERTa	7.54	0.64

Model Performance

Base LM	Dataset	Accuracy	Precision	Recall	F1	LM Perplexity
bert-base-multilingual-cased	Test	0.688	0.698	0.686	0.687	8.2
bert-base-multilingual-cased	Valid	0.62	0.592	0.605	0.55	8.2
distilbert-base-uncased	Test	0.693	0.694	0.703	0.698	6.51
distilbert-base-uncased	Valid	0.607	0.614	0.600	0.592	6.51
distilbert-base-multilingual-cased	Test	0.612	0.615	0.616	0.616	8.1
distilbert-base-multilingual-cased	Valid	0.55	0.531	0.537	0.495	8.1
roberta-base	Test	0.630	0.629	0.644	0.635	7.54
roberta-base	Valid	0.60	0.617	0.607	0.595	7.54
Ensemble	Test	0.714	0.718	0.718	0.718

Ensemble Performace

Model	Accuracy	Precision	Recall	F1	Config	Link to Model and output files
BERT	0.68866	0.69821	0.68608	0.6875	Batch Size - 16 Attention Dropout - 0.4 Learning Rate - 5e-07 Adam epsilon - 1e-08 Hidden Dropout Probability - 0.3 Epochs - 3	BERT
DistilBert	0.69333	0.69496	0.70379	0.6982	Batch Size - 16 Attention Dropout - 0.6 Learning Rate - 3e-05 Adam epsilon - 1e-08 Hidden Dropout Probability - 0.6 Epochs - 3	DistilBert
EnsembleBert1	0.69233	0.70236	0.69064	0.68952	Batch Size - 4 Attention Dropout - 0.7 Learning Rate - 5.01e-05 Adam epsilon - 4.79e-05 Hidden Dropout Probability - 0.1 Epochs - 3	EnsembleBert1
EnsembleBert2	0.691	0.7009	0.6889	0.68872	Batch Size - 4 Attention Dropout - 0.6 Learning Rate - 5.13e-05 Adam epsilon - 9.72e-05 Hidden Dropout Probability - 0.2 Epochs - 3	EnsembleBert2
EnsembleDistilBert1	0.70166	0.70377	0.70976	0.7061	Batch Size - 16 Attention Dropout - 0.8 Learning Rate - 3.02e-05 Adam epsilon - 9.35e-05 Hidden Dropout Probability - 0.4 Epochs - 3	EnsembleDistilBert1
EnsembleDistilBert2	0.689	0.691	0.69666	0.69335	Batch Size - 4 Attention Dropout - 0.6 Learning Rate - 5.13e-05 Adam epsilon - 9.72e-05 Hidden Dropout Probability - 0.2 Epochs - 3	EnsembleDistilBert2
EnsembleDistilBert3	0.69366	0.69538	0.70557	0.69905	Batch Size - 16 Attention Dropout - 0.4 Learning Rate - 4.74e-05 Adam epsilon - 4.09e-05 Hidden Dropout Probability - 0.6 Epochs - 3	EnsembleDistilBert3
Ensemble	0.71466	0.71867	0.71853	0.7182	NA	Ensemble

hinglish's People

Contributors

Stargazers

Watchers

Forkers

priyansh2 sendps335 radhikasethi2011 pj0616 nitinimage psraju123 sv1354 hjiwnain nishantb06 ameenurrehman

hinglish's Issues

Data Augmentation using English to Hindi to Hinglish transliteration

Break down into smaller issues and then assign

Step 1: Find Sentiment data for English tweets
Step 2: Translate that to Hindi using something like Textblob
Step 3: Transliterate that Hindi to Roman Script: https://github.com/libindic/indic-trans

Annotation process on Hinglish dataset

Hello guys,
I am trying to know how to annotate Hinglish data that should be useful for finetune language models or building a language model. I saw your train.json file on that text is tweet, and clean_text is preprocessed text. And what about Hindi words in the tweet? Are Hindi words tagged? I mean, I read in an article they were annotating Hinglish tweets with corresponding English version sentences. In another article, they were mentioned like each word in Hinglish sentence tagged with corresponding language and whole sentence tagged with either sentiment classification labels or any other classification problem labels.

I just wanted to know what process we were using for training data annotation purposes. Can you please tell me?

Make submissions to Codalab with NB-SVM and LR-TF-IDF

Without confidence learning
With confidence learning

Experiment with WanDB for Experimentation Tooling

Integrate with this repo if needed
Make sure we both have access to the WanDB dashboard via Github accounts

https://www.wandb.com/

Find Top 200 samples to be most likely wrong using Cleanlab

We will use these results to manually verify how much to trust this dataset labels itself

E.g. if the error is say, more than 5% - we will pass this value during test time prediction, which will have the same label error rate hopefully

Create verloop Org on huggingface models

Remove URL artifacts

The people who preprocessed this data, don't know how to use regex to remove URLs completely.

They just removed the special characters, without removing the URL itself: https t co tsrsbu

Use Macro F1 score with 5-fold CV instead of accuracy for all measurements

Use a randomized 5-fold CV, with a fixed seed state instead of a test split
Measure F1, and Accuracy - just accuracy has no meaning
Always print the confusion matrix via sklearn, or go all the way and seaborn it 🥇

Given above, would it be more useful to this within the notebook instead of a separate wonky CleanTwitter class?

Merge trial data to train data, to create a larger training set

Remove Filipino and non-Hinglish data

They have included Filipino tweets in a dataset of Hinglish tweets 🤦‍♂

I wonder how did they assign it a sentiment - do they speak Filipino too?

The truncated tweets also indicate that this was longer 280 char tweet, which they truncated to 140 characters.

Example:

happy birthday seatmate thank you kasi masginanahan ako magaral kasi katabi ko kayo ni dom hahahhah thank you are https t co jyeeskiyf

which translates to:

happy birthday seatmate thank you for helping me study because i was with you hahahhah thank you are https t co jyeeskiyf

Should we do this as a batch instead of one sentence at a time?

Reference Code: https://github.com/NirantK/Hinglish/blob/2798fda87b9b28fa1d7921203ed466c9fd23a28d/hinglishutils.py#L401-#L408

Recommended approach from Huggingface from here:

batch_sentences = ["Hello I'm a single sentence",
                   "And another sentence",
                   "And the very very last one"]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)