Giter VIP home page Giter VIP logo

contrastive-sentence-encoder's Introduction

Contrastive Learning for Sentence Embeddings

Unlike cross-encoders, bi-encoders embed sentences individually for computing similarity among them.
Bi-encoders enable real-world applications to adopt DL models by caching representations of candidate sentences.
These days, bi-encoders which trained via unsupervised contrastive learning are broadly used.

I have implemented some bi-encoder models with contrastive learning and referenced the following papers:

  • SimCSE: Simple Contrastive Learning of Sentence Embeddings (Gao et al., 2021)
  • Text and Code Embeddings by Contrastive Pre-Training (Neelakantan et al., 2022)
  • Deep Continuous Prompt for Contrastive Learning of Sentence Embeddings (Jiang and Wang, 2022)
  • Prefix-Tuning: Optimizing Continuous Prompts for Generation (Li and Liang, 2021)

Paper reviews on my own blog (Korean).

Models

Models are NOT exactly same as in their paper.

SimCSE (sup/unsup)

  • Train BERT or RoBERTa with Contrastive Loss
  • Unsupervised SimCSE uses Dropout to attain positive pairs

SimCSE with Prefix-Tuning (sup)

  • Train SimCSE with Prefix-Tuning which enables memory/time-efficient training
  • DCPCSE shows a little performance gain on similar way. But, for my own works, this model did not work very well, especially on unsupervised setting

CPT (sup/unsup)

  • Train GPT-2 with Contrastive Loss
  • Original CPT does not support fully-unsupervised setting as SimCSE does using Dropout. It uses weak supervision from noisy Internet documents
  • However, in my works, I implemented unsupervised CPT in the same way as SimCSE

CPT with Prefix-Tuning (sup/unsup)

  • Train CPT with Prefix-Tuning
  • Training of unsupervised CPT is very unstable and early stages of training determine the final model's performance

Usage (OS: Ubuntu)

Packages

  • pandas
  • numpy
  • scipy
  • torch (1.11.0)
  • transformers (4.18.0)
  • tensorboard

Download Datasets

Download datasets for training and evaluation.
I have used official SimCSE trainset for all model and STS Benchmark dataset for evaluation.

git clone https://github.com/ChainsmokersAI/Contrastive-Sentence-Encoder.git
cd Contrastive-Sentence-Encoder/
# download datasets and make directory where trained models will be saved
sh download_dataset.sh

Training

Example 1) Supervised SimCSE

python train.py --model=simcse-sup \
--base=roberta-base \
--dataset=./dataset/nli_for_simcse.csv \ # dataset for supervised models
--ddp=True \
--batch=32 \
--accum=2 \
--lr=5e-5 \
--epochs=3

Example 2) Unsupervised CPT with Prefix-Tuning

python train.py --model=cpt-unsup-prefix \
--base=gpt2 \
--dataset=./dataset/wiki1m_for_simcse.txt \ # dataset for unsupervised models
--ddp=True \
--preseqlen=5 \ # sequence length of prefix
--hidden=512 # hidden dimension size of prefix

Evaluation

Evaluate trained models on STS Benchmark dataset.

Example 1) Supervised SimCSE

python evaluate.py --model=simcse-sup \
--base=roberta-base \
--path=./model/simcse-sup\(roberta-base\)_batch256_lr5e-05_step250.pth # trained model path

Example 2) Unsupervised CPT with Prefix-Tuning

python evaluate.py --model=cpt-unsup-prefix \
--base=gpt2 \
--path=./model/cpt-unsup-prefix\(gpt2\)_preseqlen5_hidden512_batch512_lr5e-05_step250.pth \
--preseqlen=5 \
--hidden=512

Results

Models are saved per every 250 steps and best results are showed below.

Model Base LM Batch Size LR Epochs Spearmanr
simcse-sup roberta-base
(125M)
256
(batch 128*accum 2)
5e-5 3 84.20
simcse-unsup roberta-base 256 (128*2) 5e-5 3 80.80
cpt-sup gpt2
(117M)
192 (96*2) 1e-4 10 77.50
cpt-unsup gpt2 192 (96*2) 1e-4 3 66.64

with Prefix-Tuning

Model Base Prefix Batch LR Epochs Spearmanr Size
simcse
-sup-prefix
roberta
-base
10/768
(len/hidden)
128 (128*1) 5e-5 1 82.69 59.1MB
cpt-sup-prefix gpt2 5/512 192 (96*2) 1e-4 10 74.04 41.8MB
cpt-unsup-prefix gpt2 5/512 192 (96*2) 1e-4 3 69.08 41.8MB

contrastive-sentence-encoder's People

Contributors

chainsmokersai avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.