Comments (11)
turbomind.chat 接口没有进行 Batch 化。
如果希望进行 batch 化处理的话,构建一个 tm.TurboMind 对象,然后在不同的线程中使用 create_instance 构建自己的实例。权重会进行共享,输入会在turbomind 内自动拼成 batch。
from lmdeploy.
好的,多谢,那请问怎么进行batch inference的测试呢,或者如何复现你们的测试结果。
from lmdeploy.
https://github.com/InternLM/lmdeploy/blob/main/benchmark/profile_generation.py
可以使用这个工具。原理和之前说的是一样的。
如果你是希望搭建应用的话,可以使用 serving 方法,serving 是支持多 batch 的。
如果对 python 接口支持多 batch 的方式有什么建议的话也欢迎提出。
from lmdeploy.
好的多谢,请问concurrency和session_len分别代表什么意思呢
from lmdeploy.
concurrency是并发度,可以认为是 profile 时的最大 batch 数。 session_len 是一个 session 的最大长度。实际上在 deploy 时是固定的,这个参数没有用处。
from lmdeploy.
好的,请问https://github.com/InternLM/lmdeploy/blob/main/benchmark/profile_generation.py#L40,这一行的token代表什么意思呢
from lmdeploy.
是当前线程一次推理输出的 token 数量,用来计算 token/s 的。
from lmdeploy.
[[1.5223611899418756, 128, 1.5223611899418756], [1.522044335026294, 128, 1.522044335026294], [1.5217270280700177, 128, 1.5217270280700177], [1.5217442339053378, 128, 1.5217442339053378], [1.522081473027356, 128, 1.522081473027356], [1.5219722180627286, 128, 1.5219722180627286], [1.5227985020028427, 128, 1.5227985020028427], [1.521835200022906, 128, 1.521835200022906], [1.5213377749314532, 128, 1.5213377749314532], [1.5221316709648818, 128, 1.5221316709648818]]
还有非常奇怪的是为什么first_token_latency怎么几乎和token_latency相同,我的执行命令是python3 lmdeploy/benchmark/profile_generation.py /workspace llama 1 2056 128 128 10
from lmdeploy.
benchmark 的时候我们关闭了 streaming 功能。因为会对性能有影响(特别是 python 映射时有 gil 之类的问题)。因此实际推理结果是一次出所有token的,也就是你看到的结果。
搭建应用时会打开 streaming,这样用户体验更好。
from lmdeploy.
请问你们是怎么测试llama65b的性能的呢,我看profile_generation.py会oom
from lmdeploy.
65b 是用 tp 起serving的,本来 profile_generation.py 是测serving的,后来改成 python ffi 的,你可以翻翻老的 profile_generation.py 试试,或者试试看加了 tp 的分支 #82
from lmdeploy.
Related Issues (20)
- Batch infer seems no speed up HOT 8
- [Bug] WSL2环境下,0.4.2做InternVL量化时,磁盘写入速度极低 HOT 6
- 请问 TurboMind 支持cogvlm系列么? HOT 1
- [Bug] UnboundLocalError: local variable 'head_num' referenced before assignment HOT 4
- [feature] need rope_scaling_factor args in benchmark/profile_generation.py to enable dynamic NTK.
- [Feature] AWQ量化的校准数据集
- [Feature] ModuleNotFoundError: No module named 'timm' with Internal Vision model HOT 2
- [Feature] health endpoint HOT 2
- [Feature] model name should be settable or follow original full HF link name, not random new name HOT 6
- [Bug] lmdeploy lite auto_awq: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! HOT 1
- [Bug] result of W4A16 quantized Qwen1.5-1.8B-Chat model not correct HOT 1
- Support for SWIFT finetuned models HOT 2
- 总是看到一个using default GEMM algo的WARNING,是否会因为使用了默认的GEMM而影响速度或者吞吐量?
- [Bug] 下载代码执行internvl-v1.5量化,导入本地模型时报错 HOT 11
- [Feature] peft<=0.9.0 要求的版本要求太低,与较多环境要求peft>0.10冲突,能否修改
- [Feature] Support for LLaVA-NeXT
- [Docs] How are multiple images handled? HOT 4
- [Bug] output diff when temperature set zero HOT 3
- batch inference
- [Feature] InternVL-Chat-V1-5-AWQ merge LoRA adapter HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lmdeploy.