Hugging Face Transformers Library

Text Classification in Romanian Language

This repository contains the code for our team's submission to a Natural Language Processing (NLP) competition (Nitro) hosted on Kaggle. The competition challenged participants to develop a pipeline for sexism text identification in the Romanian language.

Competition Details

The task in this competition was to classify each text into one of five possible categories: (0) Sexist Direct, (1) Sexist Descriptive, (2) Sexist Reporting, (3) Non-sexist Offensive, and (4) Non-sexist Non-offensive.

Sexist:
- Direct: The post contains sexist elements and is directly addressed to a specific gender, usually women.
- Descriptive: The post describes one or more individuals, usually a woman or a group of women, in a sexist manner without directly addressing them.
- Reporting: The post reports a witnessed or heard sexist act from other sources.
Non-sexist:
- Offensive: The post does not contain sexist connotations but includes offensive language.
- Non-offensive: There are no sexist or offensive elements in the post.

The data for this competition has been collected from a variety of sources, including social media networks such as Facebook, Twitter, and Reddit, web articles, and books.

Disclaimer

The data used in this project includes instances of sexism and hate speech Therefore, reader discretion is strongly advised. The contributors to this project strongly oppose discrimination based on gender, religion, race or any other kind. One of the goals of this project is to raise awareness about gender bias online.

Training Data

The training dataset provided for this competition consists of 40,000 text files from CoRoSeOf: An annotated Corpus of Romanian Sexist and Offensive Language, while the test set comprises 3130 text files.

Participants were expected to use the training data to build a pipeline that can accurately classify the text documents in the test set into the appropriate category.

The submission was evaluated based on weighted accuracy, with the tiebreaker based on the count of false negatives in identifying offensive language.

Our Approach

Our team's approach consisted of the following steps:

Data sanitization: We removed any irrelevant information from the dataset, ensuring that it only contained data that was relevant for text classification.
Fine-tuning Romanian BERT: We fine-tuned the Romanian BERT model using the training data to improve its performance on the downstream task.
Adjusted the Cross Entropy loss based on the class weights: We tackled the problem of the imbalanced dataset by adjusting the cross-entropy loss function based on the weights of each category. This allowed us to give more weight to underrepresented categories and improve the overall performance of our model.

Results

Our team achieved 4th place out of 46 teams in the competition, obtaining a 56.84% balanced accuracy on the private test. This was the chosen metric for this competition. Our approach proved to be effective in achieving a high level of accuracy on this challenging task.

Ground Truth	Predicted	Text
non-offensive	descriptive	Bărbații sunt de obicei mai înalți decât femeile, în medie
non-offensive	non-offensive	Toate genurile merită să se simtă în siguranță în comunitățile lor
non-offensive	non-offensive	Am o nevastă și trei copii.
offensive	offensive	Sa mi bag p**a in toate numerel de inmatriculare de pe planeta asta
offensive	offensive	Învață bă unde sunt tastele al-f*tule
offensive	direct	Sa ma sugi si de pa si de p*s
descriptive	descriptive	Care e diferența dintre o femeie și o vacă? Vacii nu se enervează când le spui că au nevoie de mai mult lapte.
descriptive	descriptive	Bărbații sunt mai potriviți pentru poziții de putere.
descriptive	descriptive	Femeile au locul în bucătărie
direct	direct	Sa iti dau cu p*a peste cr pana iti raman urme rosii
direct	direct	Nu vezi fă cât ești de grasă că te scoate cu elicopterul dacă ai căzut în gaura de canalizare
direct	direct	Sunt 20 de grade afară, dar p***a ta are mai multe grade de alcoolemie după ce am stropit-o pe față cu șampanie
reporting	reporting	Normalizarea hărțuirii și a agresiunii sexuale, adesea prin prezentarea personajelor feminine ca fiind dispuse sau meritând un astfel de tratament
reporting	reporting	O tanara a fost v**lata de catre un fost iubit.
reporting	descriptive	Femeilor li se refuză dreptul de a deține proprietate sau de a avea controlul asupra propriilor finanțe în multe societăți

As we can see in the table above, the model works most of the time but, because of the metric choice of the challenge, it tends to predict false positives (sexist or offensive instead of the most common label of non-offensive). In practice, more data would be needed and a higher threshold would be set for the decision to flag a comment for sexist or offensive content.

Other Attempts

We have also attempted to improve our results through ensemble techniques and backtranslation.

However, we found that our approach using fine-tuned Romanian BERT with adjusted CE loss provided the best results.
Regarding backtranslation, although we tried to augment the dataset, we encountered difficulties due to the nature of the language being sexist and offensive. The back-translated phrases did not contain bad words, which resulted in limited improvement.

Future Approaches

We believe that further improvements could be made by better sanitizing the dataset
We believe that using a model that has been specifically trained on similar types of text could be beneficial.
Additionally, one can also explore the possibility of using data augmentation to improve the results. An approach similar to Easy Data Augmentation could be implemented to evaluate its effectiveness.

Project Structure

.
├── .devcontainer                           <- Dev Container Setup
├── .github                                 <- Github Workflows
├── .project-root                           <- Used to identify the project root
├── .vscode                                 <- Visual Studio Code settings
├── configs                                 <- Hydra configs
│   ├── data                                    <- Dataset configs
│   ├── hparams_search                          <- Hyperparameter Search configs
│   ├── hydra                                   <- Hydra runtime configs
│   ├── model                                   <- Model configs
│   ├── paths                                   <- Commonly used project paths
│   ├── predict.yaml                            <- predict.py configs
│   ├── test.yaml                               <- test.py configs
│   ├── train.yaml                              <- train.py configs
│   └── trainer                                 <- Transformers trainer configs
├── data                                    <- Datasets
│   └── ro                                      <- Romanian language
│       ├── predict_example.txt                     <- Small sample to be used with predict.py
│       ├── test_data.csv                           <- CoRoSeOf Test Data
│       └── train_data.csv                          <- CoRoSeOf Training Data
├── experiments                             <- Experiments directory
│   └── train
│       ├── multiruns                           <- Hyperparameter search
│       └── runs                                <- Single experiments
├── notebooks
│   └── hackathon_notebook.ipynb            <- The original hackathon notebook
├── predictions                             <- Results from predict.py
├── src                                     <- Source code
│   ├── __init__.py
│   ├── data                                    <- Dataset related
│   │   └── coroseof_datamodule.py
│   ├── predict.py
│   ├── test.py
│   ├── train.log
│   ├── train.py
│   ├── trainers
│   │   └── imbalanced_dataset_trainer.py       <- Custom trainer with class weights
│   └── utils
│       └── config.py                           <- Custom Omegaconf resolvers
├── submissions                             <- Results from test.py
└── tests                                   <- Tests directory

Getting Started

Thanks to our devcontainer setup you can run our model right here on GitHub. Just create a new codespace and follow the steps below! Keep in mind, training requires a GPU, which is not available on GitHub, so you might want to use Visual Studio Code for that.

Warning If running in GitHub Codespaces (or without an available GPU) you need to comment the '"runArgs": ["--gpus", "all"]' lines from the .devcontainer/devcontainer.json file. Otherwise docker will give an error and your container will not start.

Predict with a pretrained model

Keep in mind that a prediction will be made for each line of text.

# predict using text from a file
cat data/ro/predict_example.txt | python src/predict.py --models cosminc98/sexism-identification-coroseof

# see the results
cat predictions/prediction.tsv

# predict from stdin; after entering the command write however many sentences
# you want and end with the with the [EOF] marker:
#   Femeile au locul în bucătărie
#   [EOF]
python src/predict.py --models cosminc98/sexism-identification-coroseof

Training a new model

# run a single training run with CoRoSeOf dataset and default hyperparameters
# the model will be available in experiments/train/runs/
python src/train.py

# run hyperparameter search with the Optuna plugin from Hydra
# the model will be available in experiments/train/multiruns/
python src/train.py -m hparams_search=optuna

The model

Creating a new Kaggle Submission

# predict on the test set
python src/test.py --models cosminc98/sexism-identification-coroseof

Now all you need to do is upload "submissions/submission.csv" to Kaggle.

Uploading to Hugging Face

A pretrained model in the Romanian language is already available on huggingface.

pip install huggingface_hub

# add your write token using
huggingface-cli login

echo "
push_to_hub: True
hub_model_id: \"<model-name>\"
" >> configs/trainer/default.yaml