Light

tiagomantunes / karen Goto Github PK

KAREN: Unifying Hatespeech Detection and Benchmarking

Python 100.00%

abuse benchmark bert deep-learning detection framework hate hatespeech huggingface natural-language-processing nlp offensive offensive-language pytorch sentence-classification speech tfidf twitter

karen's Introduction

Hi there 👋

I'm Tiago, an engineer with a focus on performance analysis and parallelisation. My specialty is in developing high performing applications with CUDA acceleration.

I took my master's degree at Tsinghua University and I have worked on accelerating both Deep Learning and Big Data workloads.

I'm currently looking forward to continue expand my knowledge of C++ as well as start diving more into Rust. I'm proficient in many other areas.

Check the pinned projects for some of my most interesting works :) Thanks!

karen's People

Contributors

Stargazers

Watchers

Forkers

hmartelb awanit512 spkgyk

karen's Issues

Fix CharCNN evaluation

This model is designed to run on characters but it's currently being evaluated on word tokens.

Solution:

Add character embeddings to the datasets and run CharCNN in them

Deterministic computation

Currently this is how we handle the seeds for the computation, which is how it's stated in pytorch/pytorch#7068

torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)  # if you are using multi-GPU.
np.random.seed(seed)  # Numpy module.
random.seed(seed)  # Python random module.
torch.manual_seed(seed)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
import os
os.environ['PYTHONHASHSEED'] = str(seed)

But when running twice a simple SoftmaxRegression model, we get different output

tiagoantunes:hatespeech/ (master) $ python3 run.py --model softmaxregression --dataset HATexplAin --max-epochs 1 --batch-size 64                                                                                                                              [20:07:09]
******************************  CONFIGURATION  ******************************
batch_size                              64
cpu                                     False
dataset                                 ['hatexplain']
dropout                                 0.1
embedding_dim                           200
embeddings                              None
lr                                      0.001
max_epochs                              1
model                                   ['softmaxregression']
savename_hatexplain                     HateXPlain.dataset
seed                                    12345
url_hatexplain                          https://raw.githubusercontent.com/hate-alert/HateXplain/master/Data/dataset.json
***************************************************************************** 

Preprocessing HateXPlain

Starting training of (Model=softmaxregression Dataset=hatexplain)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 255/255 [00:00<00:00, 309.05it/s, loss=972]
Epoch #1 validation accuracy = 0.294789
Accuracy increased from 0 to 0.29478907585144043, saving model.

Test accuracy: 0.2730281352996826
tiagoantunes:hatespeech/ (master) $ python3 run.py --model softmaxregression --dataset HATexplAin --max-epochs 1 --batch-size 64                                                                                                                              [20:07:19]
******************************  CONFIGURATION  ******************************
batch_size                              64
cpu                                     False
dataset                                 ['hatexplain']
dropout                                 0.1
embedding_dim                           200
embeddings                              None
lr                                      0.001
max_epochs                              1
model                                   ['softmaxregression']
savename_hatexplain                     HateXPlain.dataset
seed                                    12345
url_hatexplain                          https://raw.githubusercontent.com/hate-alert/HateXplain/master/Data/dataset.json
***************************************************************************** 

Preprocessing HateXPlain

Starting training of (Model=softmaxregression Dataset=hatexplain)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 255/255 [00:00<00:00, 287.52it/s, loss=979]
Epoch #1 validation accuracy = 0.300744
Accuracy increased from 0 to 0.3007444143295288, saving model.

Test accuracy: 0.2961941659450531

I have tried using torch.use_deterministic_algorithms(True) but with no success either.

I haven't been able to find a fix. Solutions/Suggestions are appreciated.

Add missing references

A few models are missing references in their model headers. This reference should follow the pattern as in AngryBERT

Models missing reference:

CharCNN
NetLSTM

Next steps

The framework now provides an easy to use interface that speeds up model and dataset evaluation. Some things can still be improved.

Overall, we need more models. If you have any model that you would like see implemented, feel free to submit a pull request.

For the future, a few functionalities are desired:

Support more embeddings
Compatibility is very easy to break now. It would be preferable to have a configuration file with the available existing classifications
Add support for custom training. For example, some models would prefer a sparse adam or even support secondary tasks
More things can be added to the toolkit. Any suggestions?

If you have any suggestions, comment them below.
Thank you!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.