Giter VIP home page Giter VIP logo

ml_final_project's Introduction

Machine Learning Final Project

Kaggle Competition: Jigsaw Unintended Bias in Toxicity Classification

TODO

一些优化方向思路,仅供参考:D

  • EDA

  • Preprocess

    可参考之前比赛的思路,解决OOV问题

    • BPE
    • TTA
  • Model

    • Sequence model: bilstm, HAN...
    • Bert fine tune
  • Loss

    • Focal loss (根据identity决定alpha)
  • Metrics

  • Argument

    • Adversarial Training
    • 训练identity分类器(40W examples), 对jigsaw上个比赛数据进行分类,抽取identity负样本
    • VFAE
  • Tricks

Preprocessed Data

  • Managed with git-lfs (X): can not upload new objects to public fork
  • Dealing with OOV and data imbalanced problem

项目需求

detect toxic comments ― and minimize unintended model bias

根据评论数据文本,判断是否为 toxicity ( toxicity is defined as anything rude, disrespectful or otherwise likely to make someone leave a discussion. ) ,并减小模型的unintended bias,使模型输出结果更加公正

模型建立

模型搭建位于models文件夹下,bert模型位于keras_layers/keras_bert

  • Word2v + bpe + 2xbiLSTM
  • ELMo + 2xbiLSTM
  • Word2v + DGCNN
  • bert fine-tune

bias 优化

  • Sample Weight

    • Subgroup及Subgroup负样本Loss加权
  • Custom object function

    • Rank Loss、Focal Loss
  • Data Augmentation

    • 扩充Subgroup sample
    • 平衡Subgroup 正负样本比例

实验结果

model final CV
bert 0.939
bert+sample weight 0.941
bert+sample weight+label identity 0.943
bilstm+sample weight+5 fold 0.938
bilstm+sample weight+aug 0.940
dgcnn+sample weight+5 fold 0.937
elmo+bilstm+sample weight 0.938

训练

在trainer当中选择训练模型, (TODO: 添加命令行参数)

nohup python -u trainer.py > train.log 2>&1 &

ml_final_project's People

Contributors

wangshengguang avatar wyazx avatar daviddwlee84 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.