Contrastive Learning for Sentence Embeddings

Unlike cross-encoders, bi-encoders embed sentences individually for computing similarity among them.
Bi-encoders enable real-world applications to adopt DL models by caching representations of candidate sentences.
These days, bi-encoders which trained via unsupervised contrastive learning are broadly used.

I have implemented some bi-encoder models with contrastive learning and referenced the following papers:

SimCSE: Simple Contrastive Learning of Sentence Embeddings (Gao et al., 2021)
Text and Code Embeddings by Contrastive Pre-Training (Neelakantan et al., 2022)
Deep Continuous Prompt for Contrastive Learning of Sentence Embeddings (Jiang and Wang, 2022)
Prefix-Tuning: Optimizing Continuous Prompts for Generation (Li and Liang, 2021)

Paper reviews on my own blog (Korean).

Models

Models are NOT exactly same as in their paper.

SimCSE (sup/unsup)

Train BERT or RoBERTa with Contrastive Loss
Unsupervised SimCSE uses Dropout to attain positive pairs

SimCSE with Prefix-Tuning (sup)

Train SimCSE with Prefix-Tuning which enables memory/time-efficient training
DCPCSE shows a little performance gain on similar way. But, for my own works, this model did not work very well, especially on unsupervised setting

CPT (sup/unsup)

Train GPT-2 with Contrastive Loss
Original CPT does not support fully-unsupervised setting as SimCSE does using Dropout. It uses weak supervision from noisy Internet documents
However, in my works, I implemented unsupervised CPT in the same way as SimCSE

CPT with Prefix-Tuning (sup/unsup)

Train CPT with Prefix-Tuning
Training of unsupervised CPT is very unstable and early stages of training determine the final model's performance

Usage (OS: Ubuntu)

Packages

pandas
numpy
scipy
torch (1.11.0)
transformers (4.18.0)
tensorboard

Download Datasets

Download datasets for training and evaluation.
I have used official SimCSE trainset for all model and STS Benchmark dataset for evaluation.

git clone https://github.com/ChainsmokersAI/Contrastive-Sentence-Encoder.git
cd Contrastive-Sentence-Encoder/
# download datasets and make directory where trained models will be saved
sh download_dataset.sh

Training

Example 1) Supervised SimCSE

python train.py --model=simcse-sup \
--base=roberta-base \
--dataset=./dataset/nli_for_simcse.csv \ # dataset for supervised models
--ddp=True \
--batch=32 \
--accum=2 \
--lr=5e-5 \
--epochs=3

Example 2) Unsupervised CPT with Prefix-Tuning

python train.py --model=cpt-unsup-prefix \
--base=gpt2 \
--dataset=./dataset/wiki1m_for_simcse.txt \ # dataset for unsupervised models
--ddp=True \
--preseqlen=5 \ # sequence length of prefix
--hidden=512 # hidden dimension size of prefix

Evaluation

Evaluate trained models on STS Benchmark dataset.

Example 1) Supervised SimCSE

python evaluate.py --model=simcse-sup \
--base=roberta-base \
--path=./model/simcse-sup\(roberta-base\)_batch256_lr5e-05_step250.pth # trained model path

Example 2) Unsupervised CPT with Prefix-Tuning

python evaluate.py --model=cpt-unsup-prefix \
--base=gpt2 \
--path=./model/cpt-unsup-prefix\(gpt2\)_preseqlen5_hidden512_batch512_lr5e-05_step250.pth \
--preseqlen=5 \
--hidden=512

Results

Models are saved per every 250 steps and best results are showed below.

Model	Base LM	Batch Size	LR	Epochs	Spearmanr
simcse-sup	roberta-base (125M)	256 (batch 128*accum 2)	5e-5	3	84.20
simcse-unsup	roberta-base	256 (128*2)	5e-5	3	80.80
cpt-sup	gpt2 (117M)	192 (96*2)	1e-4	10	77.50
cpt-unsup	gpt2	192 (96*2)	1e-4	3	66.64

with Prefix-Tuning

Model	Base	Prefix	Batch	LR	Epochs	Spearmanr	Size
simcse -sup-prefix	roberta -base	10/768 (len/hidden)	128 (128*1)	5e-5	1	82.69	59.1MB
cpt-sup-prefix	gpt2	5/512	192 (96*2)	1e-4	10	74.04	41.8MB
cpt-unsup-prefix	gpt2	5/512	192 (96*2)	1e-4	3	69.08	41.8MB

chainsmokersai / contrastive-sentence-encoder Goto Github PK

contrastive-sentence-encoder's Introduction

Contrastive Learning for Sentence Embeddings

Models

SimCSE (sup/unsup)

SimCSE with Prefix-Tuning (sup)

CPT (sup/unsup)

CPT with Prefix-Tuning (sup/unsup)

Usage (OS: Ubuntu)

Packages

Download Datasets

Training

Evaluation

Results

contrastive-sentence-encoder's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent