Giter VIP home page Giter VIP logo

speculative_sampling's Introduction

🎛️ Accelerating Large Language Model Decoding with Speculative Sampling

Annotated Paper Notes: llm_decoding.pdf

Abstract

  • The paper introduces the idea of a sampling algorithm called speculative sampling, which according to their results speeds up the LLM decoding process by 2-2.5 times the speed, without compromising the accuracy
  • The basic idea is running two models in parallel, one smaller model that is faster, called the draft model, and a bigger model called the target model
  • The draft model is used to generate the “easier” tokens whereas the target model fills in the rest (on the parts it disagrees with the draft model)

Introduction

  • Typically the time taken for transformer models to generate a single token is proportional to the number of model parameters
  • Since the next token predicted depends on the previous, which requires the transformer to to do a full forward pass over the prior context every time, which takes more time for larger models with more parameters.
  • The proposed algorithm, preserves accuracy whilst speeding up this decoding process

Methodology

  • Setup: You have two models, a main one (q) that's the target model, and a smaller secondary one (p) that's the draft model.
  • Predict in advance: You think ahead (lookahead K) and create several (K) tokens using the auto-regressive draft model, by providing it the prefix
  • Check each prediction: You then use your target model on the prefix + the draft model generations to evaluate each of them, and producing probability distributions for each of the K tokens generated, at the same time generate a K+1 token in case needed later
  • Decision Time (Rejection Sampling):
    • Case 1: If the draft model's word is good enough (meaning its probability is not worse than the final model's word), you accept this draft word
    • Case 2: If it's not good enough, i.e. q(x) < p(x) then using a sample from uniform distribution we determine a threshold for deciding whether or not to accept it or not using q(x)/p(x)
      • Case 2.5: If rejected in this case then, we stop the loop process and sample from a adjusted distribution: (q(x) - p(x))+
  • Loop Until Complete: As long as you don’t run into Case 2.5 (i.e. reject the draft model prediction), you keep doing this until you've built a sentence that's as long as you wanted it to be
  • The best case scenario for the algorithm is that it will generate K+1 tokens, worst case is 1 token
  • In essence, this algorithm tries to 'speculate' or predict a few words ahead and checks if these predictions are good enough to keep. If they are not, it corrects them on the spot.

speculative_sampling's People

Contributors

yogesh914 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.