Giter VIP home page Giter VIP logo

llm-mock's Introduction

LLM Inference Mocking

Overview

Most of this repo is borrowed from the vLLM project for the server architecture described below:

image

This repo mocks the engine with

  • a simple first-come-first-serve scheduler with a fixed batch size
  • a mock of the compute time

Compute Time Mock

Prefills and decodes may be put into a same batch by the scheduler. When this happens, the whole batch is computed as if it is a prefill batch. More specifically, suppose in a batch of size $b$, the i-th sequence being $seq_i$, the length of the i-th sequence being $s_i$, the lenth of the i-th sequence that needs to compute for KV values being $c_i$ (for a decode, it should be 1), the compute complexity is then $O(b\cdot\max_{i}s_i\cdot\max_{i}c_i)$.

The runtime is then mock as the combined time of both compute and memory access as

$$ \frac{b\cdot\max_{i}s_i\cdot\max_{i}c_i}{c_{tflops}\cdot\alpha}+\frac{s_{model}}{m_{bandwidth}} $$

In the above expression, $\alpha$ is a coefficient to match the compute scale, $s_{model}$ is the model size, and $m_{bandwidth}$ is the GPU memory bandwidth.

In the following experiments, to match the characteristics of A100 GPU, $\alpha=1500$, $c_{tflops}=312$, $s_{model}=$ 120GB, and $m_{bandwidth}=2000$ GB/s. The model size is set to mock the llama-65B model.

Benchmarking

To benchmark the api server, this repo borrows from the vLLM project except for the following modifications (to remove randomness):

  • select requests to send deterministically
  • the engine generate the response of a deterministic length (not constant though)

To benchmark, first start the server via python api_server.py, then download the ShareGPT dataset and split the first 1K conversations and the 1K-6K conversations to small-1k.json and medium-5k.json, finally run

python benchmark_serving.py --dataset small-1k.json --num-prompts 50

Prefix Cache

Although the specs ask for radix tree, I think a trie suits better as a cache for requests. This is because for a incoming request, the cached prefix does not need recomputation, the engine only need to compute the KV values of the remaining sequence. As a result, the cache need to efficiently find the longest matching prefix in the cache. Another consideration is that the actual KV cache are stored on GPUs, the memory overhead of trie from radix tree is minimum.

This repo use a static prefix cache that is built from 5K non-overlapping conversations in the ShareGPT dataset. The building of the cache is as follow:

  1. add all prompts into the trie
  2. greedily selects the CACHE_SIZE most common edges in the trie, and prune other edges

When a new prompt comes, the scheduler queries the cache for the common prefix, note the engine the index to start computing KV values for this sequence.

Experiment Results

image

Left: request throughput and matched prefix length with respect to the CACHE_SIZE; Right: 99% TTFT and 99% TPOT latency with respect to the CACHE_SIZE.

The improvement of the prefix cache is minimum, this is because the matched prefix length is small (~1.5 tokens).The TTFT latency decreases because now the sequence length to be computed decreases by around 1 token. The TPOT latency also decreases because in some batches, decodes are mixed with prefills. This batch has a significantly larger runtime than a pure decode batch. By reducing compute by 1 token, the decode time decreases.

Effect of Batch Size

image

Left: the 99% TTFT latency and the improvement of 99% TTFT latency compared with no caching with respect to the CACHE_SIZE; Right: the 99% TPOT latency and the improvement of 99% TPOT latency compared with no caching with respect to the CACHE_SIZE.

The TTFT and TPOT both decrease with higher batch sizes. This is because for the simple FCFS scheduler, a larger batch means more chances to mix prefills with decodes, which wastes a large amount of computation. The improvement of the prefix cache is also more significant with lower batch sizes. This may be because with a larger batch size, the TPOT is larger for the no-cache baseline. Although the prefix cache reduces the runtime of mixed batches by 1~2 tokens, this improvement is insignificant compared to the larger baseline.

llm-mock's People

Watchers

Dongming (Ben) Li avatar

Forkers

jiely22

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.