Giter VIP home page Giter VIP logo

sentence-similarity's Introduction

Sentence Similarity: 句子相似度



一、数据集

下面的数据集都是中文的。

Data size(train) size(valid) size(test)
ATEC 62477 20000 20000
BQ 100000 10000 10000
LCQMC 238766 8802 12500
PAWSX 49401 2000 2000
STS-B 5231 1458 1361
SNLI 146828 2699 2618
MNLI 122547 2932 2397

训练集: SNLI 和 MNLI
测试集: ATEC、BQ、LCQMC、PAWSX 和 STS-B


二、模型

考虑到有些数据集的 test 集较小,可能会导致评估准确性偏差较大,所以这里的评估数据同时使用了train、valid和test,且最终评估结果采用了加权平均(w-avg)的方法得到。

基于RoBERTa Base 版本

这里使用相同的语言模型RoBERTa Base

Model STS-B ATEC BQ LCQMC PAWSX Avg.
BERT-Whitening 65.27 - - - - -
SimBERT 70.01 - - - - -
SBERT-Whitening 71.75 - - - - -
BAAI/bge-base-zh - - - - 78.61 -
hellonlp/simcse-base-zh 80.96 - - - - -
hellonlp/promcse-base-zh 81.57 - - - - -

基于RoBERTa Large 版本

这里使用相同的语言模型RoBERTa Large

Model STS-B(w-avg) ATEC BQ LCQMC PAWSX Avg.
BAAI/bge-large-zh 78.61 - - - - -
BAAI/bge-large-zh-v1.5 79.07 - - - - -
hellonlp/simcse-large-zh 81.32 - - - - -
hellonlp/promcse-large-zh 81.63 - - - - -

三、参考

RAG 之 Embedding 效果对比
文本语义相似度 | PromCSE 实战
文本语义相似度 | SimCSE 实战
文本语义相似度 | Sentence BERT 实战
文本语义相似度 | BERT Whitening 实战

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.