Giter VIP home page Giter VIP logo

Comments (11)

grimoire avatar grimoire commented on May 31, 2024 1

turbomind.chat 接口没有进行 Batch 化。
如果希望进行 batch 化处理的话,构建一个 tm.TurboMind 对象,然后在不同的线程中使用 create_instance 构建自己的实例。权重会进行共享,输入会在turbomind 内自动拼成 batch。

from lmdeploy.

leizhao1234 avatar leizhao1234 commented on May 31, 2024

好的,多谢,那请问怎么进行batch inference的测试呢,或者如何复现你们的测试结果。

from lmdeploy.

grimoire avatar grimoire commented on May 31, 2024

https://github.com/InternLM/lmdeploy/blob/main/benchmark/profile_generation.py
可以使用这个工具。原理和之前说的是一样的。
如果你是希望搭建应用的话,可以使用 serving 方法,serving 是支持多 batch 的。
如果对 python 接口支持多 batch 的方式有什么建议的话也欢迎提出。

from lmdeploy.

leizhao1234 avatar leizhao1234 commented on May 31, 2024

好的多谢,请问concurrency和session_len分别代表什么意思呢

from lmdeploy.

grimoire avatar grimoire commented on May 31, 2024

concurrency是并发度,可以认为是 profile 时的最大 batch 数。 session_len 是一个 session 的最大长度。实际上在 deploy 时是固定的,这个参数没有用处。

from lmdeploy.

leizhao1234 avatar leizhao1234 commented on May 31, 2024

好的,请问https://github.com/InternLM/lmdeploy/blob/main/benchmark/profile_generation.py#L40,这一行的token代表什么意思呢

from lmdeploy.

grimoire avatar grimoire commented on May 31, 2024

是当前线程一次推理输出的 token 数量,用来计算 token/s 的。

from lmdeploy.

leizhao1234 avatar leizhao1234 commented on May 31, 2024

[[1.5223611899418756, 128, 1.5223611899418756], [1.522044335026294, 128, 1.522044335026294], [1.5217270280700177, 128, 1.5217270280700177], [1.5217442339053378, 128, 1.5217442339053378], [1.522081473027356, 128, 1.522081473027356], [1.5219722180627286, 128, 1.5219722180627286], [1.5227985020028427, 128, 1.5227985020028427], [1.521835200022906, 128, 1.521835200022906], [1.5213377749314532, 128, 1.5213377749314532], [1.5221316709648818, 128, 1.5221316709648818]]

还有非常奇怪的是为什么first_token_latency怎么几乎和token_latency相同,我的执行命令是python3 lmdeploy/benchmark/profile_generation.py /workspace llama 1 2056 128 128 10

from lmdeploy.

grimoire avatar grimoire commented on May 31, 2024

benchmark 的时候我们关闭了 streaming 功能。因为会对性能有影响(特别是 python 映射时有 gil 之类的问题)。因此实际推理结果是一次出所有token的,也就是你看到的结果。
搭建应用时会打开 streaming,这样用户体验更好。

from lmdeploy.

leizhao1234 avatar leizhao1234 commented on May 31, 2024

请问你们是怎么测试llama65b的性能的呢,我看profile_generation.py会oom

from lmdeploy.

grimoire avatar grimoire commented on May 31, 2024

65b 是用 tp 起serving的,本来 profile_generation.py 是测serving的,后来改成 python ffi 的,你可以翻翻老的 profile_generation.py 试试,或者试试看加了 tp 的分支 #82

from lmdeploy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.