Giter VIP home page Giter VIP logo

sexism_identification's Introduction

python black isort
pytorch lightning hydra
tests code-quality
license PRs All Contributors

Hugging Face Transformers Library


Text Classification in Romanian Language

This repository contains the code for our team's submission to a Natural Language Processing (NLP) competition (Nitro) hosted on Kaggle. The competition challenged participants to develop a pipeline for sexism text identification in the Romanian language.

Competition Details

The task in this competition was to classify each text into one of five possible categories: (0) Sexist Direct, (1) Sexist Descriptive, (2) Sexist Reporting, (3) Non-sexist Offensive, and (4) Non-sexist Non-offensive.

  • Sexist:
    • Direct: The post contains sexist elements and is directly addressed to a specific gender, usually women.
    • Descriptive: The post describes one or more individuals, usually a woman or a group of women, in a sexist manner without directly addressing them.
    • Reporting: The post reports a witnessed or heard sexist act from other sources.
  • Non-sexist:
    • Offensive: The post does not contain sexist connotations but includes offensive language.
    • Non-offensive: There are no sexist or offensive elements in the post.

The data for this competition has been collected from a variety of sources, including social media networks such as Facebook, Twitter, and Reddit, web articles, and books.

Disclaimer

The data used in this project includes instances of sexism and hate speech Therefore, reader discretion is strongly advised. The contributors to this project strongly oppose discrimination based on gender, religion, race or any other kind. One of the goals of this project is to raise awareness about gender bias online.

Training Data

The training dataset provided for this competition consists of 40,000 text files from CoRoSeOf: An annotated Corpus of Romanian Sexist and Offensive Language, while the test set comprises 3130 text files.

Participants were expected to use the training data to build a pipeline that can accurately classify the text documents in the test set into the appropriate category.

The submission was evaluated based on weighted accuracy, with the tiebreaker based on the count of false negatives in identifying offensive language.

Our Approach

Our team's approach consisted of the following steps:

  • Data sanitization: We removed any irrelevant information from the dataset, ensuring that it only contained data that was relevant for text classification.

  • Fine-tuning Romanian BERT: We fine-tuned the Romanian BERT model using the training data to improve its performance on the downstream task.

  • Adjusted the Cross Entropy loss based on the class weights: We tackled the problem of the imbalanced dataset by adjusting the cross-entropy loss function based on the weights of each category. This allowed us to give more weight to underrepresented categories and improve the overall performance of our model.

Results

Our team achieved 4th place out of 46 teams in the competition, obtaining a 56.84% balanced accuracy on the private test. This was the chosen metric for this competition. Our approach proved to be effective in achieving a high level of accuracy on this challenging task.

Ground Truth Predicted Text
non-offensive descriptive Bărbații sunt de obicei mai înalți decât femeile, în medie
non-offensive non-offensive Toate genurile merită să se simtă în siguranță în comunitățile lor
non-offensive non-offensive Am o nevastă și trei copii.
offensive offensive Sa mi bag p**a in toate numerel de inmatriculare de pe planeta asta
offensive offensive Învață bă unde sunt tastele a**l-f***tule
offensive direct Sa ma sugi si de p**a si de p***s
descriptive descriptive Care e diferența dintre o femeie și o vacă? Vacii nu se enervează când le spui că au nevoie de mai mult lapte.
descriptive descriptive Bărbații sunt mai potriviți pentru poziții de putere.
descriptive descriptive Femeile au locul în bucătărie
direct direct Sa iti dau cu p**a peste c*r pana iti raman urme rosii
direct direct Nu vezi fă cât ești de grasă că te scoate cu elicopterul dacă ai căzut în gaura de canalizare
direct direct Sunt 20 de grade afară, dar p***a ta are mai multe grade de alcoolemie după ce am stropit-o pe față cu șampanie
reporting reporting Normalizarea hărțuirii și a agresiunii sexuale, adesea prin prezentarea personajelor feminine ca fiind dispuse sau meritând un astfel de tratament
reporting reporting O tanara a fost v**lata de catre un fost iubit.
reporting descriptive Femeilor li se refuză dreptul de a deține proprietate sau de a avea controlul asupra propriilor finanțe în multe societăți

As we can see in the table above, the model works most of the time but, because of the metric choice of the challenge, it tends to predict false positives (sexist or offensive instead of the most common label of non-offensive). In practice, more data would be needed and a higher threshold would be set for the decision to flag a comment for sexist or offensive content.

Other Attempts

We have also attempted to improve our results through ensemble techniques and backtranslation.

  • However, we found that our approach using fine-tuned Romanian BERT with adjusted CE loss provided the best results.

  • Regarding backtranslation, although we tried to augment the dataset, we encountered difficulties due to the nature of the language being sexist and offensive. The back-translated phrases did not contain bad words, which resulted in limited improvement.

Future Approaches

  • We believe that further improvements could be made by better sanitizing the dataset

  • We believe that using a model that has been specifically trained on similar types of text could be beneficial.

  • Additionally, one can also explore the possibility of using data augmentation to improve the results. An approach similar to Easy Data Augmentation could be implemented to evaluate its effectiveness.

Project Structure

.
├── .devcontainer                           <- Dev Container Setup
├── .github                                 <- Github Workflows
├── .project-root                           <- Used to identify the project root
├── .vscode                                 <- Visual Studio Code settings
├── configs                                 <- Hydra configs
│   ├── data                                    <- Dataset configs
│   ├── hparams_search                          <- Hyperparameter Search configs
│   ├── hydra                                   <- Hydra runtime configs
│   ├── model                                   <- Model configs
│   ├── paths                                   <- Commonly used project paths
│   ├── predict.yaml                            <- predict.py configs
│   ├── test.yaml                               <- test.py configs
│   ├── train.yaml                              <- train.py configs
│   └── trainer                                 <- Transformers trainer configs
├── data                                    <- Datasets
│   └── ro                                      <- Romanian language
│       ├── predict_example.txt                     <- Small sample to be used with predict.py
│       ├── test_data.csv                           <- CoRoSeOf Test Data
│       └── train_data.csv                          <- CoRoSeOf Training Data
├── experiments                             <- Experiments directory
│   └── train
│       ├── multiruns                           <- Hyperparameter search
│       └── runs                                <- Single experiments
├── notebooks
│   └── hackathon_notebook.ipynb            <- The original hackathon notebook
├── predictions                             <- Results from predict.py
├── src                                     <- Source code
│   ├── __init__.py
│   ├── data                                    <- Dataset related
│   │   └── coroseof_datamodule.py
│   ├── predict.py
│   ├── test.py
│   ├── train.log
│   ├── train.py
│   ├── trainers
│   │   └── imbalanced_dataset_trainer.py       <- Custom trainer with class weights
│   └── utils
│       └── config.py                           <- Custom Omegaconf resolvers
├── submissions                             <- Results from test.py
└── tests                                   <- Tests directory

Getting Started

Thanks to our devcontainer setup you can run our model right here on GitHub. Just create a new codespace and follow the steps below! Keep in mind, training requires a GPU, which is not available on GitHub, so you might want to use Visual Studio Code for that.

Warning If running in GitHub Codespaces (or without an available GPU) you need to comment the '"runArgs": ["--gpus", "all"]' lines from the .devcontainer/devcontainer.json file. Otherwise docker will give an error and your container will not start.

Predict with a pretrained model

Keep in mind that a prediction will be made for each line of text.

# predict using text from a file
cat data/ro/predict_example.txt | python src/predict.py --models cosminc98/sexism-identification-coroseof

# see the results
cat predictions/prediction.tsv

# predict from stdin; after entering the command write however many sentences
# you want and end with the with the [EOF] marker:
#   Femeile au locul în bucătărie
#   [EOF]
python src/predict.py --models cosminc98/sexism-identification-coroseof

Training a new model

# run a single training run with CoRoSeOf dataset and default hyperparameters
# the model will be available in experiments/train/runs/
python src/train.py

# run hyperparameter search with the Optuna plugin from Hydra
# the model will be available in experiments/train/multiruns/
python src/train.py -m hparams_search=optuna

The model

Creating a new Kaggle Submission

# predict on the test set
python src/test.py --models cosminc98/sexism-identification-coroseof

Now all you need to do is upload "submissions/submission.csv" to Kaggle.

Uploading to Hugging Face

A pretrained model in the Romanian language is already available on huggingface.

pip install huggingface_hub

# add your write token using
huggingface-cli login

echo "
push_to_hub: True
hub_model_id: \"<model-name>\"
" >> configs/trainer/default.yaml

Contact

If you have any questions about our approach or our code, please feel free to contact us at:

Contributors ✨

Ștefan-Cosmin Ciocan
Ștefan-Cosmin Ciocan

💻 📖🔬
Iulian Taiatu
Iulian Taiatu

💻 📖 🔬
AndreiDumitrescu99
AndreiDumitrescu99

💻 🔬

This project follows the all-contributors specification. Contributions of any kind welcome!

sexism_identification's People

Contributors

cosminc98 avatar iulian277 avatar andreidumitrescu99 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.