Giter VIP home page Giter VIP logo

Comments (6)

pommedeterresautee avatar pommedeterresautee commented on August 20, 2024

Thank you @kobzaond for your report, it's very interesting.

Below is what comes to my mind, don't hesitate to let me know what you think about it.

Regarding the generate vs fixed input, it's an interesting experiment. With the generate function you have smaller input at the beginning and larger ones at the end. It implies different trade-off between cache overhead and computation graph optimization. At the end, the input size (6 or 500 tokens) and the output size will lead to different profiles. I thought that using a fixed size it makes things more clear.

The reason why I didn't included it is that on the machine/GPU I have made the measures... there is no difference on HuggingFace with / without cache (I am not alone in this situation).
The most obvious reason for that is that T4 is an old, cheap and not very performant cloud GPU. Don't get me wrong, it's an awesome GPU when you take the price into account, much better bet than doing inference on CPU. The measures have been done on a 3090 RTX as displayed at the beginning of the notebook.
I suspect on last generation GPUs the cache overhead (copying/concatenating arrays, etc.) is not especially improved but the computation is.
Regarding the beam num 5, basically it implies that inference is done on a batch size of 5 (one per sequence path), so I imagine it just increase the discrepancy you see on greedy search.

To finish something to keep in mind:
https://developer.nvidia.com/blog/optimizing-t5-and-gpt-2-for-real-time-inference-with-tensorrt/

TensorRT optimizes the self-attention block by pointwise layer fusion:

  • Reduction is fused with power ops (for LayerNorm and residual-add layer).
  • Scale is fused with softmax.
  • GEMM is fused with ReLU/GELU activations.

Additionally, TensorRT also optimizes the network for inference:

  • Eliminating transpose ops.
  • Fusing the three KQV projections into a single GEMM.
  • When FP16 mode is specified, controlling layer-wise precisions to preserve accuracy while running the most compute-intensive ops in FP16.

Some fusions like Fusing the three KQV projections into a single GEMM. won't work with cache.

I will add something to the notebook (code + explanation) regarding that point, 1/ to show that on a fast GPU cache has no effect (with generate function so it's a real use case) and 2/ that depending of their hardware it may be different.

from transformer-deploy.

kobzaond avatar kobzaond commented on August 20, 2024

Thank you for so quick answer!

yeah, it can be because of the GPU - I am using AWS and the T4 seems like one of the best gpu for inference they offer. You are lucky with your gpu, because without past_key_values things are much more easier :D

If you could add some more benchmarks regarding the generate method and cache/no cache it would be awesome!

There is one another thing I've found in the demo - when you are using tensorrt with cache, you are not building a new tensorrt engine, so i suppose it is the same engine that doesn't work with cache, so i don't know what is really happening here, but to me, it seems that the engine cannot take advantage of precomputed keys and values.

In the demo, you don't optimize the onnx model using cache - I tried it but it failed, I guess that is the reason, why it is not in the demo. So I would like to ask you, whether you think there is a way how to speed up gpt2 generation (w.r.t. pytorch) on the T4 gpu? Because when I run the model with CUDAExecutionProvider, the performance is +- same as pytorch and when i use it without cache the performance goes down..

from transformer-deploy.

pommedeterresautee avatar pommedeterresautee commented on August 20, 2024

Both TensorRT / ONNX Runtime kernel fusion patterns are hard coded.
ONNX Runtime ones are specific to each model (you can see them in fusion_* in https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers/), and for TensorRT it's a bit hard to know what it does because its core component is closed source.

I have not tried to optimize model with cache because self-attention related optimizations will not work for obvious reason and performance without cache was surprisingly similar.

To accelerate things on T4, you can try int-8 quantization on TensorRT side (T4 supports natively quantization with its tensor cores, but older GPU like V100 or P100 do not), and for ONNX Runtime, I would not optimize it and just check the results with FP16 and cache support. Usually it gives half the total perf boost alone. If not enough, you can disable each optimization to find the culprit.

from transformer-deploy.

kobzaond avatar kobzaond commented on August 20, 2024

I'm reading your update in the gpt2 demo notebook and to me it is very interesting, that for gpt2 (pytorch), you get

Pytorch with cache: 2.35s/sequence
Pytorch without cache: 2.32s/sequence

while I get

Pytorch with cache: 2.01s/sequence
Pytorch without cache: 3.41s/sequence

The surprise is, that I got a bit faster inference with cache than you without cache - I'd thought, that if you have faster GPU you should have lower numbers here. Can it be because of memory bandwidth? Would you mind to share your GPU model?

from transformer-deploy.

pommedeterresautee avatar pommedeterresautee commented on August 20, 2024

It's a 3090, all details are at the beginning of the notebook.
https://github.com/ELS-RD/transformer-deploy/blob/main/demo/generative-model/gpt2.ipynb

image

from transformer-deploy.

pommedeterresautee avatar pommedeterresautee commented on August 20, 2024

closing because of no activity
Moreover, in T5 notebook, cache is used in a generative model and can serve of model for the GPT 2 model

from transformer-deploy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.