Giter VIP home page Giter VIP logo

flashrank's Introduction

๐ŸŽ๏ธ FlashRank

Ultra-lite & Super-fast Python library to add re-ranking to your existing search & retrieval pipelines. It is based on SoTA cross-encoders.

  1. โšก Ultra-lite:

    • No Torch or Transformers needed. Runs on CPU.
    • Boasts the tiniest reranking model in the world, ~4MB.
  2. โฑ๏ธ Super-fast:

    • Rerank speed is a function of # of tokens in passages, query + model depth (layers)
    • To give an idea, Time taken by the example (in code) using the default model is below.
    • Detailed benchmarking, TBD
  3. ๐Ÿ’ธ $ concious:

    • Lowest $ per invocation: Serverless deployments like Lambda are charged by memory & time per invocation*
    • Smaller package size = shorter cold start times, quicker re-deployments for Serverless.
  4. ๐ŸŽฏ Based on SoTA Cross-encoders:

    • Below are the list of models supported as of now.

      • ms-marco-TinyBERT-L-2-v2 (default)
      • ms-marco-MiniLM-L-12-v2
      • ms-marco-MultiBERT-L-12 (Multi-lingual, supports 100+ languages)
    • Why only sleeker models? Reranking is the final leg of larger retrieval pipelines, idea is to avoid any extra overhead especially for user-facing scenarios. To that end models with really small footprint that doesn't need any specialised hardware and yet offer competitive performance are chosen. Feel free to raise issues to add support for a new models as you see fit.

๐Ÿš€ Installation:

pip install flashrank

Usage:

from flashrank.Ranker import Ranker
# Nano (~4MB), blazing fast model & competitive performance (ranking precision).
ranker = Ranker()

or 

# Small (~34MB), slightly slower & best performance (ranking precision).
ranker = Ranker(model_name="ms-marco-MiniLM-L-12-v2", cache_dir="/opt")

or 

# Medium (~150MB), slower model with best performance (ranking precision) for 100+ languages including en.
ranker = Ranker(model_name="ms-marco-MultiBERT-L-12", cache_dir="/opt")
query = "Tricks to accelerate LLM inference"
passages = [
    "Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.",
    "LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper",
    "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods Iโ€™ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face.  This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second.  - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint.  - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run. ",
    "Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup. ",
    "vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels"
]
results = ranker.rerank(query, passages)
print(results)
[{'score': 0.99806124, 'passage': 'Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.'}, 
{'score': 0.95966834, 'passage': "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods Iโ€™ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face.  This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second.  - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint.  - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run. "}, 
{'score': 0.620731, 'passage': 'vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels'}, 
{'score': 0.56146526, 'passage': 'LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper'}, 
{'score': 0.098350815, 'passage': 'Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup. '}]

You can use it with any search & retrieval pipeline:

  1. Lexical Search (RegularDBs that supports full-text search or Inverted Index)


  1. Semantic Search / RAG usecases (VectorDBs)


  1. Hybrid Search


Deployment patterns

How to use it in a AWS Lambda function ?

In AWS or other serverless environments the entire VM is read-only you might have to create your own custom dir and use it for loading the models (and eventually as a cache between warm calls). You can do it during init with cache_dir parameter.

ranker = Ranker(model_name="ms-marco-MiniLM-L-12-v2", cache_dir="/opt")

flashrank's People

Contributors

prithivirajdamodaran avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.