This repository contains the final code and documentation for the SSL Project.
The topic of the project is Abusive Language Detection in Social Media.
The project offers multiple baselines and State-of-the-Art Models.
As requirments for running the experiments you need the following APIs: TensorFlow, Numpy, PyTorch, HuggingFace Transformers, Sklearn, Pandas, NLTK.
The structure of the repository is the following:
* docs
folder: contains all the documentation presented during this semester
* datasets
folder: contains all the datasets used for this project
* plots
folder: contains some graphs plotted during the first iteration experiments
* baselines.py
, first_iteration.py
, second_iteration.py
files: contain the code for training and evaluating
the baseline / first iteration / second iteration
* torch_dataset.py
file: contains a custom dataset implementation
* data_reader.py
, preprocess.py
files: contain auxiliary functions used for data reading and processing
* utils.py
file: contains auxiliary functions used for plotting graphs
* prediction.py
file: contains code for predicting on different datasets
Each of the files baselines.py
, first_iteration.py
, second_iteration.py
contain functions that follow the template: run_X_experiments(...)
.
To train and evaluate the second iteration models simply run:
python second_iteration.py
The models from the baselines and first iteration will be saved in a folder called models
.
The models from the second iteration will be saved in a folder called models_torch
.
Make sure that these folders exist!
To run the prediction demo simply run:
python prediction.py
You can change the global parameters to run on any dataset you want.