so that every one can use it out-of-box?

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Yes, absolutely! cc <a class="user-mention notranslate" data-hovercard-type="user" dat

Most of these features are already supported in <a href="https://github.com/Lightning-

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Will these optimization integrate into hf's code? about gpt-fast HOT 7 OPEN

pytorch-labs commented on June 10, 2024 1

Will these optimization integrate into hf's code?

from gpt-fast.

Comments (7)

Chillee commented on June 10, 2024 3

@SunMarc I think there might still be some gaps in how the kv-cache is handled during inference. Specifically, the link you sent is about vision models, not text generation.

We should chat more about this - i'd love to see the techniques here integrated.

from gpt-fast.

SunMarc commented on June 10, 2024 1

Thanks for the interest ! We already support most of the optimization described here:

Torch.compile with pytorch blog here
4-bit quant with GPTQ and recently AWQ which is faster
Speculative Decoding.

from gpt-fast.

SunMarc commented on June 10, 2024 1

Yes, absolutely! cc @younesbelkada for visibility

from gpt-fast.

aniketmaurya commented on June 10, 2024

Most of these features are already supported in Lit-GPT (if you're looking for finetuning LLMs) and more of this will be supported soon. You can use LLMs from HF model hub.

from gpt-fast.

yhyu13 commented on June 10, 2024

These opt should already in hf. Moreover, some specific opt made for hardware like writing your cuda knerl for GPTQ and paged attention (e.g. flash_attn2) would make inference even faster.

https://github.com/turboderp/exllamav2 has bench marked llama-7b with 190+ t/s on single 3090Ti which matches this repo on 8xA100, but 3090Ti is only about 1/3 flops of a single A100. So hardware opt also plays as another drive.

from gpt-fast.

lucasjinreal commented on June 10, 2024

Hi， does torch.complie works with AWQ?

(seems hf already supports AWQ, but quantization way might not same as this repo)

How to enable speculative decoding in hf?

from gpt-fast.

Chillee commented on June 10, 2024

@yhyu13

https://github.com/turboderp/exllamav2 has bench marked llama-7b with 190+ t/s on single 3090Ti which matches this repo on 8xA100, but 3090Ti is only about 1/3 flops of a single A100.

To be clear, the benchmark on this repo is at 197 t/s on a single A100 with a groupsize of 32, while exllamav2 is running a single 4090 with a groupsize of 128.

Still certainly very good results from exllamav2 :)

from gpt-fast.

Recommend Projects

Will these optimization integrate into hf's code? about gpt-fast HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent