Light

P40 int8推理过于慢 about baichuan-13b HOT 12 CLOSED

baichuan-inc commented on May 17, 2024

P40 int8推理过于慢

from baichuan-13b.

Comments (12)

shesung commented on May 17, 2024 2

就没用到int8计算。这里量化只是压缩了参数的存储大小，计算还是用fp16/fp32。现在大部分加速库，比如LLM.int8() ，都是基于tensor core。P40的int8加速是使用DP4A指令，跟tensor core的指令体系完全不同，估计未来这些加速库对pascal gpu的支持也够呛。还是趁早换20系之后的卡吧。。。

from baichuan-13b.

luanshaotong commented on May 17, 2024 1

请问您的物理内存是多少，我单卡p40 32g都爆内存，就是没启动成功过。另外p40应该不支持fp16吧

主机内存96G。测试加载的时候大概会用40G左右的内存

p40可以兼容fp16，速度和fp32是一样的（内存可能也和fp32一样）。

from baichuan-13b.

luanshaotong commented on May 17, 2024

~~另外查看好像cudnn的库没有安装，有没有影响~~ 发现是误会，cudnn装了

from baichuan-13b.

jianghaiqun commented on May 17, 2024

请问您的物理内存是多少，我单卡p40 32g都爆内存，就是没启动成功过。
另外p40应该不支持fp16吧

from baichuan-13b.

jameswu2014 commented on May 17, 2024

速度慢应该是正常的，现在是采用混合精度来实现。主要目的是省显存。内存不够，试试调整一下swap区，看看能不能行。

from baichuan-13b.

luanshaotong commented on May 17, 2024

速度慢应该是正常的，现在是采用混合精度来实现。主要目的是省显存。内存不够，试试调整一下swap区，看看能不能行。

@jameswu2014 非常感谢，这样我就明白了。后续有没有计划直接int8计算，或者其他的加速方案比如fastertransformer？

我们正在迭代，请持续关注，谢谢。

from baichuan-13b.

mynewstart commented on May 17, 2024

速度慢应该是正常的，现在是采用混合精度来实现。主要目的是省显存。内存不够，试试调整一下swap区，看看能不能行。

@jameswu2014 非常感谢，这样我就明白了。后续有没有计划直接int8计算，或者其他的加速方案比如fastertransformer？

请问比较慢的原因是因为模型中间计算还是用的fp16寸的，只是模型参数变为int8了是吗？以及中间结果用fp16存的话，为何不能和量化前的模型速度差不多，主要是慢在哪个地方了？

慢在了int8->fp16,反量化。后续我们会迭代，请持续关注，谢谢。

from baichuan-13b.

Qbuer commented on May 17, 2024

就没用到int8计算。这里量化只是压缩了参数的存储大小，计算还是用fp16/fp32。现在大部分加速库，比如LLM.int8() ，都是基于tensor core。P40的int8加速是使用DP4A指令，跟tensor core的指令体系完全不同，估计未来这些加速库对pascal gpu的支持也够呛。还是趁早换20系之后的卡吧。。。

@shesung 请问“没用到int8计算” 是啥意思？GPU指令集级的int8计算优化吗？

from baichuan-13b.

shesung commented on May 17, 2024

@Qbuer 是的，10系的int8加速指令是DP4A，大部分LLM加速库都没有支持这个指令。

from baichuan-13b.

mynewstart commented on May 17, 2024

就没用到int8计算。这里量化只是压缩了参数的存储大小，计算还是用fp16/fp32。现在大部分加速库，比如LLM.int8() ，都是基于tensor core。P40的int8加速是使用DP4A指令，跟tensor core的指令体系完全不同，估计未来这些加速库对pascal gpu的支持也够呛。还是趁早换20系之后的卡吧。。。

@shesung 求大佬再解释下, 想现在的A100是支持int8计算吗？是因为加速库的指令系统和A100支持的不同吗？

from baichuan-13b.

shesung commented on May 17, 2024

@mynewstart A100支持。主流的llm加速库几乎都是基于tensor core，所以从V100开始的卡几乎都支持int8加速。

from baichuan-13b.

mynewstart commented on May 17, 2024

@shesung 感谢大佬回答! 我之前使用AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)在A100上inference模型为什么感觉没有加速，反而还更慢了，这是什么原因呀？

from baichuan-13b.

Related Issues (20)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.