Comments (2)
Im not sure what you mean by "shared by multiple concurrent inferences"
If you run with --enable-prefix-caching
, then vLLM keep around KV tokens in an LRU cache and try to share them across prompts
from vllm.
If there are multiple users sending concurrent requests for inference to the same physical machine, even though the requests come from different users, as long as the inference is being performed on the same physical machine, those requests can share the KVcache if theyhave the same prefix?
from vllm.
Related Issues (20)
- [Usage]: how to save sharded state? HOT 4
- [Bug]: CI will not run `fastcheck` if `ready` label is applied HOT 10
- [Feature]: Testing - Use `torch.testing.assert_close` instead of `torch.allclose` as a Recommended Practice
- [Bug][0.5.4] Front-end server errors when overloaded with pending requests
- test
- [Bug]: TPU Dockerfile build fails HOT 1
- [Feature]: Support attention backend with FlexAttention HOT 3
- [Feature]: Improve the compile times of gptq_marlin.cu HOT 2
- [Usage]: Can we get count for current processing tokens from openai api?
- [Feature]: DeepSeek-Coder-V2-Instruct-FP8 on 8xA100 HOT 8
- [Usage]: How to config the parameters to support higher concurrency for deploying the qwen2-7b model as an API at 8-GPU A800 (80G) server? HOT 4
- [Bug]: Endless generation with fine tuned llama 3.1 model HOT 4
- 为了便于交流,创建了一个多模态大模型交流群,欢迎大家入群交流学习~ HOT 2
- [Usage]: 流式输出前面几个字符为啥要设置成空字符?不能直接输出模型的生成吗 HOT 2
- [Bug]: Compiling FSM index high memory && subprocess OOM
- [Usage]: Does vllm support dynamic quantization HOT 1
- [Feature]: support voice llm like cosyvoice HOT 1
- [Bug]: Extra body don't work when response_format is also sent for serving. HOT 6
- [Feature]: Small Model Large Latency Compared to SGLang and TensorRT-LLM HOT 2
- [Bug]: `ops.scaled_fp8_quant` returns wrong shape when input shape is () HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.