在batch为2的情况下，我执行了 python3 -m lmdeploy.turbomind.chat llama /workspace [0,1] 以及将input_i

请问怎么进行batch inference about lmdeploy HOT 11 CLOSED

leizhao1234 commented on May 31, 2024

请问怎么进行batch inference

from lmdeploy.

Comments (11)

grimoire commented on May 31, 2024 1

turbomind.chat 接口没有进行 Batch 化。
如果希望进行 batch 化处理的话，构建一个 tm.TurboMind 对象，然后在不同的线程中使用 create_instance 构建自己的实例。权重会进行共享，输入会在turbomind 内自动拼成 batch。

from lmdeploy.

leizhao1234 commented on May 31, 2024

好的，多谢，那请问怎么进行batch inference的测试呢，或者如何复现你们的测试结果。

from lmdeploy.

grimoire commented on May 31, 2024

https://github.com/InternLM/lmdeploy/blob/main/benchmark/profile_generation.py
可以使用这个工具。原理和之前说的是一样的。
如果你是希望搭建应用的话，可以使用 serving 方法，serving 是支持多 batch 的。
如果对 python 接口支持多 batch 的方式有什么建议的话也欢迎提出。

from lmdeploy.

leizhao1234 commented on May 31, 2024

好的多谢，请问concurrency和session_len分别代表什么意思呢

from lmdeploy.

grimoire commented on May 31, 2024

concurrency是并发度，可以认为是 profile 时的最大 batch 数。 session_len 是一个 session 的最大长度。实际上在 deploy 时是固定的，这个参数没有用处。

from lmdeploy.

leizhao1234 commented on May 31, 2024

好的，请问https://github.com/InternLM/lmdeploy/blob/main/benchmark/profile_generation.py#L40，这一行的token代表什么意思呢

from lmdeploy.

grimoire commented on May 31, 2024

是当前线程一次推理输出的 token 数量，用来计算 token/s 的。

from lmdeploy.

leizhao1234 commented on May 31, 2024

[[1.5223611899418756, 128, 1.5223611899418756], [1.522044335026294, 128, 1.522044335026294], [1.5217270280700177, 128, 1.5217270280700177], [1.5217442339053378, 128, 1.5217442339053378], [1.522081473027356, 128, 1.522081473027356], [1.5219722180627286, 128, 1.5219722180627286], [1.5227985020028427, 128, 1.5227985020028427], [1.521835200022906, 128, 1.521835200022906], [1.5213377749314532, 128, 1.5213377749314532], [1.5221316709648818, 128, 1.5221316709648818]]

还有非常奇怪的是为什么first_token_latency怎么几乎和token_latency相同，我的执行命令是python3 lmdeploy/benchmark/profile_generation.py /workspace llama 1 2056 128 128 10

from lmdeploy.

grimoire commented on May 31, 2024

benchmark 的时候我们关闭了 streaming 功能。因为会对性能有影响（特别是 python 映射时有 gil 之类的问题）。因此实际推理结果是一次出所有token的，也就是你看到的结果。
搭建应用时会打开 streaming，这样用户体验更好。

from lmdeploy.

leizhao1234 commented on May 31, 2024

请问你们是怎么测试llama65b的性能的呢，我看profile_generation.py会oom

from lmdeploy.

grimoire commented on May 31, 2024

65b 是用 tp 起serving的，本来 profile_generation.py 是测serving的，后来改成 python ffi 的，你可以翻翻老的 profile_generation.py 试试，或者试试看加了 tp 的分支 #82

from lmdeploy.

请问怎么进行batch inference about lmdeploy HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent