Kaggle Competition: Jigsaw Unintended Bias in Toxicity Classification
一些优化方向思路,仅供参考:D
-
EDA
-
Preprocess
可参考之前比赛的思路,解决OOV问题
- BPE
- TTA
-
Model
- Sequence model: bilstm, HAN...
- Bert fine tune
-
Loss
- Focal loss (根据identity决定alpha)
-
Metrics
-
Argument
- Adversarial Training
- 训练identity分类器(40W examples), 对jigsaw上个比赛数据进行分类,抽取identity负样本
- VFAE
-
Tricks
- multi-task
- sample_weights
- Managed with
git-lfs
(X): can not upload new objects to public fork - Dealing with OOV and data imbalanced problem
detect toxic comments ― and minimize unintended model bias
根据评论数据文本,判断是否为 toxicity
( toxicity is defined as anything rude, disrespectful or otherwise likely to make someone leave a discussion. )
,并减小模型的unintended bias,使模型输出结果更加公正
模型搭建位于models
文件夹下,bert
模型位于keras_layers/keras_bert
- Word2v + bpe + 2xbiLSTM
- ELMo + 2xbiLSTM
- Word2v + DGCNN
- bert fine-tune
-
Sample Weight
- Subgroup及Subgroup负样本Loss加权
-
Custom object function
- Rank Loss、Focal Loss
-
Data Augmentation
- 扩充Subgroup sample
- 平衡Subgroup 正负样本比例
model | final CV |
---|---|
bert | 0.939 |
bert+sample weight | 0.941 |
bert+sample weight+label identity | 0.943 |
bilstm+sample weight+5 fold | 0.938 |
bilstm+sample weight+aug | 0.940 |
dgcnn+sample weight+5 fold | 0.937 |
elmo+bilstm+sample weight | 0.938 |
在trainer当中选择训练模型, (TODO: 添加命令行参数)
nohup python -u trainer.py > train.log 2>&1 &