This repository contains the code for our team's submission to a Natural Language Processing (NLP) competition (Nitro) hosted on Kaggle. The competition challenged participants to develop a pipeline for sexism text identification in the Romanian language.
The task in this competition was to classify each text into one of five
possible categories: (0) Sexist Direct
, (1) Sexist Descriptive
, (2) Sexist Reporting
, (3) Non-sexist Offensive
, and (4) Non-sexist Non-offensive
.
- Sexist:
- Direct: The post contains sexist elements and is directly addressed to a specific gender, usually women.
- Descriptive: The post describes one or more individuals, usually a woman or a group of women, in a sexist manner without directly addressing them.
- Reporting: The post reports a witnessed or heard sexist act from other sources.
- Non-sexist:
- Offensive: The post does not contain sexist connotations but includes offensive language.
- Non-offensive: There are no sexist or offensive elements in the post.
The data for this competition has been collected from a variety of sources, including social media networks such as Facebook, Twitter, and Reddit, web articles, and books.
The data used in this project includes instances of sexism and hate speech Therefore, reader discretion is strongly advised. The contributors to this project strongly oppose discrimination based on gender, religion, race or any other kind. One of the goals of this project is to raise awareness about gender bias online.
The training dataset provided for this competition consists of 40,000
text files from CoRoSeOf: An annotated Corpus of Romanian Sexist and Offensive Language, while the test set comprises 3130
text files.
Participants were expected to use the training data to build a pipeline that can accurately classify the text documents in the test set into the appropriate category.
The submission was evaluated based on weighted accuracy
, with the tiebreaker based on the count of false negatives in identifying offensive language.
Our team's approach
consisted of the following steps:
-
Data sanitization
: We removed any irrelevant information from the dataset, ensuring that it only contained data that was relevant for text classification. -
Fine-tuning Romanian BERT
: We fine-tuned the Romanian BERT model using the training data to improve its performance on the downstream task. -
Adjusted the
Cross Entropy loss
based on the class weights: We tackled the problem of theimbalanced
dataset by adjusting the cross-entropy loss function based on theweights
of each category. This allowed us to give more weight to underrepresented categories and improve the overall performance of our model.
Our team achieved 4th
place out of 46 teams
in the competition, obtaining a 56.84% balanced accuracy
on the private test. This was the chosen metric for this competition. Our approach proved to be effective in achieving a high level of accuracy on this challenging task.
Ground Truth | Predicted | Text |
---|---|---|
non-offensive | descriptive | Bărbații sunt de obicei mai înalți decât femeile, în medie |
non-offensive | non-offensive | Toate genurile merită să se simtă în siguranță în comunitățile lor |
non-offensive | non-offensive | Am o nevastă și trei copii. |
offensive | offensive | Sa mi bag p**a in toate numerel de inmatriculare de pe planeta asta |
offensive | offensive | Învață bă unde sunt tastele a**l-f***tule |
offensive | direct | Sa ma sugi si de p**a si de p***s |
descriptive | descriptive | Care e diferența dintre o femeie și o vacă? Vacii nu se enervează când le spui că au nevoie de mai mult lapte. |
descriptive | descriptive | Bărbații sunt mai potriviți pentru poziții de putere. |
descriptive | descriptive | Femeile au locul în bucătărie |
direct | direct | Sa iti dau cu p**a peste c*r pana iti raman urme rosii |
direct | direct | Nu vezi fă cât ești de grasă că te scoate cu elicopterul dacă ai căzut în gaura de canalizare |
direct | direct | Sunt 20 de grade afară, dar p***a ta are mai multe grade de alcoolemie după ce am stropit-o pe față cu șampanie |
reporting | reporting | Normalizarea hărțuirii și a agresiunii sexuale, adesea prin prezentarea personajelor feminine ca fiind dispuse sau meritând un astfel de tratament |
reporting | reporting | O tanara a fost v**lata de catre un fost iubit. |
reporting | descriptive | Femeilor li se refuză dreptul de a deține proprietate sau de a avea controlul asupra propriilor finanțe în multe societăți |
As we can see in the table above, the model works most of the time but, because of the metric choice of the challenge, it tends to predict false positives (sexist or offensive instead of the most common label of non-offensive). In practice, more data would be needed and a higher threshold would be set for the decision to flag a comment for sexist or offensive content.
We have also attempted to improve our results through ensemble
techniques and backtranslation
.
-
However, we found that our approach using fine-tuned Romanian BERT with adjusted CE loss provided the best results.
-
Regarding backtranslation, although we tried to augment the dataset, we encountered difficulties due to the nature of the language being sexist and offensive. The back-translated phrases did not contain bad words, which resulted in limited improvement.
-
We believe that further improvements could be made by
better sanitizing
the dataset -
We believe that using a model that has been specifically trained on
similar types of text
could be beneficial. -
Additionally, one can also explore the possibility of using
data augmentation
to improve the results. An approach similar to Easy Data Augmentation could be implemented to evaluate its effectiveness.
.
├── .devcontainer <- Dev Container Setup
├── .github <- Github Workflows
├── .project-root <- Used to identify the project root
├── .vscode <- Visual Studio Code settings
├── configs <- Hydra configs
│ ├── data <- Dataset configs
│ ├── hparams_search <- Hyperparameter Search configs
│ ├── hydra <- Hydra runtime configs
│ ├── model <- Model configs
│ ├── paths <- Commonly used project paths
│ ├── predict.yaml <- predict.py configs
│ ├── test.yaml <- test.py configs
│ ├── train.yaml <- train.py configs
│ └── trainer <- Transformers trainer configs
├── data <- Datasets
│ └── ro <- Romanian language
│ ├── predict_example.txt <- Small sample to be used with predict.py
│ ├── test_data.csv <- CoRoSeOf Test Data
│ └── train_data.csv <- CoRoSeOf Training Data
├── experiments <- Experiments directory
│ └── train
│ ├── multiruns <- Hyperparameter search
│ └── runs <- Single experiments
├── notebooks
│ └── hackathon_notebook.ipynb <- The original hackathon notebook
├── predictions <- Results from predict.py
├── src <- Source code
│ ├── __init__.py
│ ├── data <- Dataset related
│ │ └── coroseof_datamodule.py
│ ├── predict.py
│ ├── test.py
│ ├── train.log
│ ├── train.py
│ ├── trainers
│ │ └── imbalanced_dataset_trainer.py <- Custom trainer with class weights
│ └── utils
│ └── config.py <- Custom Omegaconf resolvers
├── submissions <- Results from test.py
└── tests <- Tests directory
Thanks to our devcontainer setup you can run our model right here on GitHub. Just create a new codespace and follow the steps below! Keep in mind, training requires a GPU, which is not available on GitHub, so you might want to use Visual Studio Code for that.
Warning If running in GitHub Codespaces (or without an available GPU) you need to comment the '"runArgs": ["--gpus", "all"]' lines from the .devcontainer/devcontainer.json file. Otherwise docker will give an error and your container will not start.
Keep in mind that a prediction will be made for each line of text.
# predict using text from a file
cat data/ro/predict_example.txt | python src/predict.py --models cosminc98/sexism-identification-coroseof
# see the results
cat predictions/prediction.tsv
# predict from stdin; after entering the command write however many sentences
# you want and end with the with the [EOF] marker:
# Femeile au locul în bucătărie
# [EOF]
python src/predict.py --models cosminc98/sexism-identification-coroseof
# run a single training run with CoRoSeOf dataset and default hyperparameters
# the model will be available in experiments/train/runs/
python src/train.py
# run hyperparameter search with the Optuna plugin from Hydra
# the model will be available in experiments/train/multiruns/
python src/train.py -m hparams_search=optuna
The model
# predict on the test set
python src/test.py --models cosminc98/sexism-identification-coroseof
Now all you need to do is upload "submissions/submission.csv" to Kaggle.
A pretrained model in the Romanian language is already available on huggingface.
pip install huggingface_hub
# add your write token using
huggingface-cli login
echo "
push_to_hub: True
hub_model_id: \"<model-name>\"
" >> configs/trainer/default.yaml
If you have any questions about our approach or our code, please feel free to contact us at:
Ștefan-Cosmin Ciocan 💻 📖🔬 |
Iulian Taiatu 💻 📖 🔬 |
AndreiDumitrescu99 💻 🔬 |
This project follows the all-contributors specification. Contributions of any kind welcome!