Hello, your wrapper for gpt2 does not support 'past_key_values' as h

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

It's a 3090, all details are at the beginning of the notebook. <a href="https://gi

GPT2 has slow inference about transformer-deploy HOT 6 CLOSED

els-rd commented on August 20, 2024

GPT2 has slow inference

from transformer-deploy.

Comments (6)

pommedeterresautee commented on August 20, 2024

Thank you @kobzaond for your report, it's very interesting.

Below is what comes to my mind, don't hesitate to let me know what you think about it.

Regarding the generate vs fixed input, it's an interesting experiment. With the generate function you have smaller input at the beginning and larger ones at the end. It implies different trade-off between cache overhead and computation graph optimization. At the end, the input size (6 or 500 tokens) and the output size will lead to different profiles. I thought that using a fixed size it makes things more clear.

The reason why I didn't included it is that on the machine/GPU I have made the measures... there is no difference on HuggingFace with / without cache (I am not alone in this situation).
The most obvious reason for that is that T4 is an old, cheap and not very performant cloud GPU. Don't get me wrong, it's an awesome GPU when you take the price into account, much better bet than doing inference on CPU. The measures have been done on a 3090 RTX as displayed at the beginning of the notebook.
I suspect on last generation GPUs the cache overhead (copying/concatenating arrays, etc.) is not especially improved but the computation is.
Regarding the beam num 5, basically it implies that inference is done on a batch size of 5 (one per sequence path), so I imagine it just increase the discrepancy you see on greedy search.

To finish something to keep in mind:
https://developer.nvidia.com/blog/optimizing-t5-and-gpt-2-for-real-time-inference-with-tensorrt/

TensorRT optimizes the self-attention block by pointwise layer fusion:

Reduction is fused with power ops (for LayerNorm and residual-add layer).

Scale is fused with softmax.

GEMM is fused with ReLU/GELU activations.

Additionally, TensorRT also optimizes the network for inference:

Eliminating transpose ops.

Fusing the three KQV projections into a single GEMM.

When FP16 mode is specified, controlling layer-wise precisions to preserve accuracy while running the most compute-intensive ops in FP16.

Some fusions like Fusing the three KQV projections into a single GEMM. won't work with cache.

I will add something to the notebook (code + explanation) regarding that point, 1/ to show that on a fast GPU cache has no effect (with generate function so it's a real use case) and 2/ that depending of their hardware it may be different.

from transformer-deploy.

kobzaond commented on August 20, 2024

Thank you for so quick answer!

yeah, it can be because of the GPU - I am using AWS and the T4 seems like one of the best gpu for inference they offer. You are lucky with your gpu, because without past_key_values things are much more easier :D

If you could add some more benchmarks regarding the generate method and cache/no cache it would be awesome!

There is one another thing I've found in the demo - when you are using tensorrt with cache, you are not building a new tensorrt engine, so i suppose it is the same engine that doesn't work with cache, so i don't know what is really happening here, but to me, it seems that the engine cannot take advantage of precomputed keys and values.

In the demo, you don't optimize the onnx model using cache - I tried it but it failed, I guess that is the reason, why it is not in the demo. So I would like to ask you, whether you think there is a way how to speed up gpt2 generation (w.r.t. pytorch) on the T4 gpu? Because when I run the model with CUDAExecutionProvider, the performance is +- same as pytorch and when i use it without cache the performance goes down..

from transformer-deploy.

pommedeterresautee commented on August 20, 2024

Both TensorRT / ONNX Runtime kernel fusion patterns are hard coded.
ONNX Runtime ones are specific to each model (you can see them in fusion_* in https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers/), and for TensorRT it's a bit hard to know what it does because its core component is closed source.

I have not tried to optimize model with cache because self-attention related optimizations will not work for obvious reason and performance without cache was surprisingly similar.

To accelerate things on T4, you can try int-8 quantization on TensorRT side (T4 supports natively quantization with its tensor cores, but older GPU like V100 or P100 do not), and for ONNX Runtime, I would not optimize it and just check the results with FP16 and cache support. Usually it gives half the total perf boost alone. If not enough, you can disable each optimization to find the culprit.

from transformer-deploy.

kobzaond commented on August 20, 2024

I'm reading your update in the gpt2 demo notebook and to me it is very interesting, that for gpt2 (pytorch), you get

Pytorch with cache: 2.35s/sequence
Pytorch without cache: 2.32s/sequence

while I get

Pytorch with cache: 2.01s/sequence
Pytorch without cache: 3.41s/sequence

The surprise is, that I got a bit faster inference with cache than you without cache - I'd thought, that if you have faster GPU you should have lower numbers here. Can it be because of memory bandwidth? Would you mind to share your GPU model?

from transformer-deploy.

pommedeterresautee commented on August 20, 2024

It's a 3090, all details are at the beginning of the notebook.
https://github.com/ELS-RD/transformer-deploy/blob/main/demo/generative-model/gpt2.ipynb

from transformer-deploy.

pommedeterresautee commented on August 20, 2024

closing because of no activity
Moreover, in T5 notebook, cache is used in a generative model and can serve of model for the GPT 2 model

from transformer-deploy.

GPT2 has slow inference about transformer-deploy HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent