Giter VIP home page Giter VIP logo

toxiccommentclassification's Introduction

Toxic Comment Classification

Multilabel Text Classification with:

  • TF-IDF + LogisiticRegression (baseline)
  • Pretrained BertModel

Dataset

The dataset is available at kaggle. It contains more than 310k Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:

  • toxic
  • severe_toxic
  • obscene
  • threat
  • insult
  • identity_hate

Config

The user interface consists of file:

  • config.yaml - general configuration with data and model parameters

Default config.yaml:

seed: 42

data:
  train_data_path: ../data/train.csv
  test_data_path: ../data/test.csv
  test_size: 0.05
  sep: ','
  text_column: comment_text
  target_columns: ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
  output_path: output
  log_file_path: logs.txt

model_type: baseline  # bert (baseline = tf_idf + logreg)
save_model: true
test_run: true


tf-idf:
  word:
    sublinear_tf: true
    strip_accents: unicode
    analyzer: word
    ngram_range: (1, 1)
    max_features: 10000
  char:
    sublinear_tf: true
    strip_accents: unicode
    analyzer: char
    ngram_range: (2, 6)
    max_features: 50000

logreg:
  penalty: l2
  C: 1.0
  class_weight: balanced
  solver: lbfgs
  n_jobs: -1

bert:
  bert_model_name: bert-base-cased
  max_token_len: 128
  batch_size: 32
  epochs: 2
  learning_rate: 0.00002
  warmup_frac: 0.2

Usage

Run in terminal: python -m Toxic_comment_classification

Output

After training the model, the pipeline will return the following files:

  • logreg_model.joblib / bert_best_model.pt -- saved models
  • logs.txt -- file with logs

toxiccommentclassification's People

Contributors

shadall avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.