thu-coai / sentilare Goto Github PK

Codes for our paper "SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge" (EMNLP 2020)

Python 100.00%

sentiment-polarity sentiment-analysis sentiment-classification pretrained-models bert linguistic-features part-of-speech pytorch

sentilare's Introduction

SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge

Introduction

SentiLARE is a sentiment-aware pre-trained language model enhanced by linguistic knowledge. You can read our paper for more details. This project is a PyTorch implementation of our work.

Dependencies

Python 3
NumPy
Scikit-learn
PyTorch >= 1.3.0
PyTorch-Transformers (Huggingface) 1.2.0
TensorboardX
Sentence Transformers 0.2.6 (Optional, used for linguistic knowledge acquisition during pre-training and fine-tuning)
NLTK (Optional, used for linguistic knowledge acquisition during pre-training and fine-tuning)

Quick Start for Fine-tuning

Datasets of Downstream Tasks

Our experiments contain sentence-level sentiment classification (e.g. SST / MR / IMDB / Yelp-2 / Yelp-5) and aspect-level sentiment analysis (e.g. Lap14 / Res14 / Res16). You can download the pre-processed datasets (Google Drive / Tsinghua Cloud) of the downstream tasks. The detailed description of the data formats is attached to the datasets.

Fine-tuning

To quickly conduct the fine-tuning experiments, you can directly download the checkpoint (Google Drive / Tsinghua Cloud) of our pre-trained model. We show the example of fine-tuning SentiLARE on SST as follows:

cd finetune
CUDA_VISIBLE_DEVICES=0,1,2 python run_sent_sentilr_roberta.py \
          --data_dir data/sent/sst \
          --model_type roberta \
          --model_name_or_path pretrain_model/ \
          --task_name sst \
          --do_train \
          --do_eval \
          --max_seq_length 256 \
          --per_gpu_train_batch_size 4 \
          --learning_rate 2e-5 \
          --num_train_epochs 3 \
          --output_dir sent_finetune/sst \
          --logging_steps 100 \
          --save_steps 100 \
          --warmup_steps 100 \
          --eval_all_checkpoints \
          --overwrite_output_dir

Note that data_dir is set to the directory of pre-processed SST dataset, and model_name_or_path is set to the directory of the pre-trained model checkpoint. output_dir is the directory to save the fine-tuning checkpoints. You can refer to the fine-tuning codes to get the description of other hyper-parameters.

More details about fine-tuning SentiLARE on other datasets can be found in finetune/README.MD.

POS Tagging and Polarity Acquisition for Downstream Tasks

During pre-processing, we tokenize the original datasets with NLTK, tag the sentences with Stanford Log-Linear Part-of-Speech Tagger, and obtain the sentiment polarity with Sentence-BERT.

Pre-training

If you want to conduct pre-training by yourself instead of directly using the checkpoint we provide, this part may help you pre-process the pre-training dataset and run the pre-training scripts.

Dataset

We use Yelp Dataset Challenge 2019 as our pre-training dataset. According to the Term of Use of Yelp dataset, you should download Yelp dataset on your own.

POS Tagging and Polarity Acquisition for Pre-training Dataset

Similar to fine-tuning, we also conduct part-of-speech tagging and sentiment polarity acquisition on the pre-training dataset. Note that since the pre-training dataset is quite large, the pre-processing procedure may take a long time because we need to use Sentence-BERT to obtain the representation vectors of all the sentences in the pre-training dataset.

Pre-training

Refer to pretrain/README.MD for more implementation details about pre-training.

Citation

@inproceedings{ke-etal-2020-sentilare,
    title = "{S}enti{LARE}: Sentiment-Aware Language Representation Learning with Linguistic Knowledge",
    author = "Ke, Pei  and Ji, Haozhe  and Liu, Siyang  and Zhu, Xiaoyan  and Huang, Minlie",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    pages = "6975--6988",
}

Please kindly cite our paper if this paper and the codes are helpful.

Thanks

Many thanks to the GitHub repositories of Transformers and BERT-PT. Part of our codes are modified based on their codes.

sentilare's People

Contributors

Stargazers

Watchers

Forkers

xl2248 cgq0816 johnson7788 zhangh tunguyen-uet trendingtechnology xiaoanshi hjh1213 techthiyanes yongtso scutcyr amymok1318

sentilare's Issues

Tensor dimension

A dimension problem was encountered while running prep_sgni.py:

fine tune on a new dataset

Hi @thu-coai @hzhwcmhf @MaLiN2223 @zqwerty @xiaotianzi @truthless11 and thanks so much for making your code available.
I want to fine tune the code on a new dataset that the format is very similar to IMDB dataset (it has a couple of sentences and label is positive/negative/neutral). Could you please advise on what changes I need to make?

I appreciate your time and help :).

支持中文吗

pretrain部分缺少modeling_bert.py文件

您好，看到您的modeling_roberta.py文件对modeling_bert有所依赖（例如：第175行），可否麻烦您提供一下modeling_bert.py文件？

Pretraining code of Label-aware MLM

I have read your paper and found it the label-aware masked-language model pretraining with SentiWordNet integration very interesting. I am curious to see how it is implemented practically for my research.

I was wondering if you had any plans to publish the code of this part of the research?

the config.json Should have a `model_type` key in its config.json

ValueError: Unrecognized model in ../SentiLARE. Should have a model_type key in its config.json, or contain one of the following strings in its name: fnet, gptj, layoutlmv2, beit, rembert, visual_bert, canine, roformer, clip, bigbird_pegasus, deit, luke, detr, gpt_neo, big_bird, speech_to_text_2, speech_to_text, vit, wav2vec2, m2m_100, convbert, led, blenderbot-small, retribert, ibert, mt5, t5, mobilebert, distilbert, albert, bert-generation, camembert, xlm-roberta, pegasus, marian, mbart, megatron-bert, mpnet, bart, blenderbot, reformer, longformer, roberta, deberta-v2, deberta, flaubert, fsmt, squeezebert, hubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm-prophetnet, prophetnet, xlm, ctrl, electra, speech-encoder-decoder, encoder-decoder, funnel, lxmert, dpr, layoutlm, rag, tapas, splinter

Because I want to use it to get sentence embedding

关于数据预处理部分问题

   您好，请问关于数据预处理部分，我的数据是imdb,yelp2013,yelp2014分别是10、5、5分类问题。数据是整合成像raw_data/sent/imdb下的数据，  句子+label 吗 ？（多分类问题需要改对应的代码吗） 之后使用preprocess/prep_sent.py 去进行预处理吗？
 谢谢您的阅读 ，期待您的回复！

Finetune on IMDB and SST downsteam task and the performance is not good.

Hi,

I have downloaded the pretrained model and the preprocessed data(*_newpos.txt) from google drive.
For SST the reported val result is 55.04% in Table 15 and i get the performance 56.04%. however i only can get 84.92% on IMDB but the reported performance if 95.96% in Table 15.
And when i finetune the released model for downstream emotion recognition tasks, the results is poor than Bert-based model and RoBerta-Base model.

Is there any suggestions of these problems? Thanks!

for lr in 2e-5 do data_dir=/data7/emobert/resources/pretrained/sentilare/raw_data/sent/sst/ CUDA_VISIBLE_DEVICES=${gpuid} python run_sent_sentilr_roberta.py \ --cvNo 1 \ --data_dir ${data_dir} \ --model_type roberta \ --model_name_or_path ${pretrain_model_dir} \ --task_name sst \ --do_train \ --do_eval \ --max_seq_length 50 \ --per_gpu_train_batch_size 32 \ --per_gpu_eval_batch_size 32 \ --learning_rate ${lr} \ --num_train_epochs 8 \ --warmup_steps 100 \ --eval_all_checkpoints \ --overwrite_output_dir \ --seed 42 \ --output_dir ${output_dir}/${corpus_name}_roberta_base_finetune_lr${lr}_bs32_m2 done

thu-coai / sentilare Goto Github PK

sentilare's Introduction

SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge

Introduction

Dependencies

Quick Start for Fine-tuning

Datasets of Downstream Tasks

Fine-tuning

POS Tagging and Polarity Acquisition for Downstream Tasks

Pre-training

Dataset

POS Tagging and Polarity Acquisition for Pre-training Dataset

Pre-training

Citation

Thanks

sentilare's People

Contributors

Stargazers

Watchers

Forkers

sentilare's Issues

Recommend Projects

Recommend Topics

Recommend Org