Giter VIP home page Giter VIP logo

fixtext's Introduction

Fix Text

Fix Text extends the idea behind of Fix Match to text based inputs. The proposed method also builds upon the work done in the "Multimodal Semi-supervised Learning for Disaster Tweet Classification” research paper by extracting the main ideas behind the text augmentations. The algorithm for Fix Text is really similar to the Fix Match one, the only main difference being the augmentations we apply on the input text. This code was used to produce the results presented in this paper.

Our Approach - Cyberbullying Detection Task

Our goal is to to compare how models trained in a semi supervised manner using the Fix Text algorithm perform against the models trained in a classic supervised way on a concrete task.
Fix Text was tested on the Cyberbullying Detection task on 4 different datasets, in 2 separate scenarios. Each dataset has a different data distribution from the others, and they try to cover distinct social media contexts: Tweets, YouTube Comments and Kaggle Comments. When training the model in a semi supervised way on one of the 4 datasets will act as a labeled dataset, the rest 3 of them will act as an unlabeled dataset. The datasets used are: the Cyberbullying Tweets dataset, Kaggle Comments dataset , YouTube Comments dataset and another Tweets Dataset.
The experiments were run in two different setups: one setup where the model tries to learn by using all the samples from the given labeled dataset, and another setup where the model is restricted to learn only from 10 samples from each category from the given labeled dataset. The first scenario is the classic approach where we have access to a full labeled dataset and to a lot more unlabeled data. The second scenario is a more uncommon one where we have access to only a few samples, but again a lot more unlabeled data.
The architecture used is a simple one based on a pretrained BERT and a Classification Head built on top of it. The BERT model will have the weights unfrozen and will act as a feature extractor. The Classification Head is made out of 2 sequential Linear Layers.

Augmentations

For the moment, the augmentations supported are:

  • Backtranslation English-to-German,
  • Backtranslation English-to-Russian,
  • EDA_01 -> only 10% of the words from a sample are affected by the EDA operations,
  • EDA_02 -> only 20% of the words from a sample are affected by the EDA operations,
  • EDA_SR -> applies only the synonym replacement operation on 10% of the words from a sample.

Disclaimer

The data used in this project includes instances of sexism and hate speech. Therefore, reader discretion is strongly advised. The contributors to this project strongly oppose discrimination based on gender, religion, race or any other kind. One of the goals of this project is to raise awareness about cyberbullying's negative effects.

Project Structure

.
│   .gitignore
│   README.md
|   setup.py
│
├───.vscode
│       settings.json
│
└───fixtext                      <- All necessary code for Fix Text. 
    │   test.py                      <- Test function for evaluating the models.
    │   train.py                     <- Train function for Fix Text.
    │   __init__.py
    │
    ├───augmentations                <- Augmentations. 
    │       backtranslate.py             <- Computes backtranslation.
    │       eda.py                       <- Defines operations for EDA Augmentation.
    │       precompute_eda.py            <- Precomputes EDA Augmentation for a given dataset.    
    │       stop_words.py                <- Contains a list of stop words.
    │       utils.py                     <- Utils function for computing EDA.
    │       __init__.py
    │
    ├───data                         <- Dataset.
    │       get_datasets.py              <- Functions for building JSONL Datasets and Data Loaders.
    │       text_dataset.py              <- Defines JsonlDataset Class.
    │       vocab.py                     <- Defines Vocabulary Class.
    │       __init__.py
    │
    ├───models                       <- Models.
    │       bert.py                      <- Defines ClassificationBert Class.
    │       trainer_helpers.py           <- Functions for extracting the optimizer and scheduler.
    │       __init__.py
    │
    └───utils                        <- Utils Functions.
            argument_parser.py           <- Argument Parser for the training script.
            average_meter.py             <- AverageMeter Class to save statistics regarding metrics.
            metrics.py                   <- Defines metrics of interest.
            utils.py                     <- Other utils functions.
            __init__.py

Getting Started

Setup

In order to setup the working space:

  • git clone this repository:
git clone https://github.com/AndreiDumitrescu99/FixText.git
  • make a virtual env where you will install all the needed packages, for example:
python3 -m venv dev_venv
source dev_venv/bin/activate
  • intall the FixText module:
pip install -e .

Dataset Preparation

You will need 4 different jsonl files: one for the labeled training set, one for the unlabeled training, one for the validation set and one for the testing set. A sample from these files should have the following JSON format:

{
    "text": "Some text",
    "textDE": "Backtranslated text with german.",
    "textRU": "Backtranslated text with russian.",
    "label": "Label of the sample.",
    "dataset": "From which dataset it belongs.",
    "split": "either: training, testing or validation",
    "eda_01": ["...", "...", ...],
    "eda_02": ["...", "...", ...],
    "eda_sr": ["...", "...", ...],
}

To obtain this format, you should start with a JSONL file where the samples have the following format:

{
    "text": "Some text",
    "label": "Label of the sample.",
    "dataset": "From which dataset it belongs.",
    "split": "either: training, testing or validation",
}

Using the backtranslate.py script, you can generate the textDE and textRU properties. Using the precompute_eda.py script, you can generate the eda_01, eda_02 and eda_sr properties.

Training

Supposse you have 2 datasets: X, Y on which you want to run FixText, and you've run the steps presented above for dataset preparation. The final folder structure where the datasets are stored should look like this:

.
│
├───dev
│       dataset_X.jsonl
|       dataset_Y.jsonl
│
├───test
│       dataset_X.jsonl
|       dataset_Y.jsonl
│
├───train
│       dataset_X.jsonl
|       dataset_Y.jsonl
│
└───unlabeled
        unlabeled_dataset_X.jsonl
        unlabeled_dataset_Y.jsonl

Take a closer look at the arguments from the argument_parser.py and decide with what hyperparameters you want to run.
Initially, if you want to run the FixText algorithm on the dataset X, simply run the train.py file with the desired arguments, or you can simply run with the default parameters by doing this:

python train.py --data_path path_to_datasets_folders --unlabeled_dataset unlabeled_dataset_X.jsonl --task dataset_X.jsonl --out path_to_output_folder

Further Work

As further work, I would like:

  • to add support for Hydra
  • to better generalize the use case of Fix Text
  • add unit tests for some functions
  • add more augmentations

fixtext's People

Contributors

andreidumitrescu99 avatar

Stargazers

Diana Ionescu avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.