Giter VIP home page Giter VIP logo

els-rd / transformer-deploy Goto Github PK

View Code? Open in Web Editor NEW
1.6K 27.0 153.0 33.93 MB

Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀

Home Page: https://els-rd.github.io/transformer-deploy/

License: Apache License 2.0

Dockerfile 0.45% Python 99.11% Makefile 0.44%
inference deep-learning natural-language-processing deployment machine-learning server

transformer-deploy's Introduction

Hugging Face Transformer submillisecond inference️ and deployment to production: 🤗 → 🤯

Documentation tests Python 3.6 Twitter Follow

Optimize and deploy in production 🤗 Hugging Face Transformer models in a single command line.

=> Up to 10X faster inference! <=

Why this tool?

At Lefebvre Dalloz we run in production semantic search engines in the legal domain, in non-marketing language it's a re-ranker, and we based ours on Transformer.
In those setup, latency is key to provide good user experience, and relevancy inference is done online for hundreds of snippets per user query.
We have tested many solutions, and below is what we found:

Pytorch + FastAPI = 🐢
Most tutorials on Transformer deployment in production are built over Pytorch and FastAPI. Both are great tools but not very performant in inference (actual measures below).

Microsoft ONNX Runtime + Nvidia Triton inference server = ️🏃💨
Then, if you spend some time, you can build something over ONNX Runtime and Triton inference server. You will usually get from 2X to 4X faster inference compared to vanilla Pytorch. It's cool!

Nvidia TensorRT + Nvidia Triton inference server = ⚡️🏃💨💨
However, if you want the best in class performances on GPU, there is only a single possible combination: Nvidia TensorRT and Triton. You will usually get 5X faster inference compared to vanilla Pytorch.
Sometimes it can rise up to 10X faster inference.
Buuuuttt... TensorRT can ask some efforts to master, it requires tricks not easy to come up with, we implemented them for you!

Detailed tool comparison table

Features

  • Heavily optimize transformer models for inference (CPU and GPU) -> between 5X and 10X speedup
  • deploy models on Nvidia Triton inference servers (enterprise grade), 6X faster than FastAPI
  • add quantization support for both CPU and GPU
  • simple to use: optimization done in a single command line!
  • supported model: any model that can be exported to ONNX (-> most of them)
  • supported tasks: document classification, token classification (NER), feature extraction (aka sentence-transformers dense embeddings), text generation

Want to understand how it works under the hood?
read 🤗 Hugging Face Transformer inference UNDER 1 millisecond latency 📖

Want to check by yourself in 3 minutes?

To have a raw idea of what kind of acceleration you will get on your own model, you can try the docker only run below. For GPU run, you need to have installed on your machine Nvidia drivers and NVIDIA Container Toolkit.

3 tasks are covered below:

  • Classification,
  • feature extraction (text to dense embeddings)
  • text generation (GPT-2 style).

Moreover, we have added a GPU quantization notebook to open directly on Docker to play with.

First, clone the repo as some commands below expect to find the demo folder:

git clone [email protected]:ELS-RD/transformer-deploy.git
cd transformer-deploy
# docker image may take a few minutes
docker pull ghcr.io/els-rd/transformer-deploy:0.6.0 


### Classification/reranking (encoder model)

Classification is a common task in NLP, and large language models have shown great results.  
This task is also used for search engines to provide Google like relevancy (cf. [arxiv](https://arxiv.org/abs/1901.04085))

#### Optimize existing model

This will optimize models, generate Triton configuration and Triton folder layout in a single command:

```shell
docker run -it --rm --gpus all \
  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \
  bash -c "cd /project && \
    convert_model -m \"philschmid/MiniLM-L6-H384-uncased-sst2\" \
    --backend tensorrt onnx \
    --seq-len 16 128 128"

# output:  
# ...
# Inference done on NVIDIA GeForce RTX 3090
# latencies:
# [Pytorch (FP32)] mean=5.43ms, sd=0.70ms, min=4.88ms, max=7.81ms, median=5.09ms, 95p=7.01ms, 99p=7.53ms
# [Pytorch (FP16)] mean=6.55ms, sd=1.00ms, min=5.75ms, max=10.38ms, median=6.01ms, 95p=8.57ms, 99p=9.21ms
# [TensorRT (FP16)] mean=0.53ms, sd=0.03ms, min=0.49ms, max=0.61ms, median=0.52ms, 95p=0.57ms, 99p=0.58ms
# [ONNX Runtime (FP32)] mean=1.57ms, sd=0.05ms, min=1.49ms, max=1.90ms, median=1.57ms, 95p=1.63ms, 99p=1.76ms
# [ONNX Runtime (optimized)] mean=0.90ms, sd=0.03ms, min=0.88ms, max=1.23ms, median=0.89ms, 95p=0.95ms, 99p=0.97ms
# Each infence engine output is within 0.3 tolerance compared to Pytorch output

It will output mean latency and other statistics.
Usually Nvidia TensorRT is the fastest option and ONNX Runtime is usually a strong second option.
On ONNX Runtime, optimized means that kernel fusion and mixed precision are enabled.
Pytorch is never competitive on transformer inference, including mixed precision, whatever the model size.

Run Nvidia Triton inference server

Note that we install transformers at run time.
For production, it's advised to build your own 3-line Docker image with transformers pre-installed.

docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \
  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
  bash -c "pip install transformers && tritonserver --model-repository=/models"

# output:
# ...
# I0207 09:58:32.738831 1 grpc_server.cc:4195] Started GRPCInferenceService at 0.0.0.0:8001
# I0207 09:58:32.739875 1 http_server.cc:2857] Started HTTPService at 0.0.0.0:8000
# I0207 09:58:32.782066 1 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002

Query inference

Query ONNX models (replace transformer_onnx_inference by transformer_tensorrt_inference to query TensorRT engine):

curl -X POST  http://localhost:8000/v2/models/transformer_onnx_inference/versions/1/infer \
  --data-binary "@demo/infinity/query_body.bin" \
  --header "Inference-Header-Content-Length: 161"

# output:
# {"model_name":"transformer_onnx_inference","model_version":"1","parameters":{"sequence_id":0,"sequence_start":false,"sequence_end":false},"outputs":[{"name":"output","datatype":"FP32","shape":[1,2],"data":[-3.431640625,3.271484375]}]}

Model output is at the end of the Json (data field). More information about how to query the server from Python, and other languages.

To get very low latency inference in your Python code (no inference server): click here

Token-classification (NER) (encoder model)

Token classification assigns a label to individual tokens in a sentence. One of the most common token classification tasks is Named Entity Recognition (NER). NER attempts to find a label for each entity in a sentence, such as a person, location, or organization.

Optimize existing model

This will optimize models, generate Triton configuration and Triton folder layout in a single command:

docker run -it --rm --gpus all \
  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \
  bash -c "cd /project && \
    convert_model -m \"kamalkraj/bert-base-cased-ner-conll2003\" \
    --backend tensorrt onnx \
    --seq-len 16 128 128 \
    --task token-classification"

# output:  
# ...
# Inference done on Tesla T4
# latencies:
# [Pytorch (FP32)] mean=8.24ms, sd=0.46ms, min=7.66ms, max=13.91ms, median=8.20ms, 95p=8.38ms, 99p=10.01ms
# [Pytorch (FP16)] mean=6.87ms, sd=0.44ms, min=6.69ms, max=13.05ms, median=6.78ms, 95p=7.33ms, 99p=8.86ms
# [TensorRT (FP16)] mean=2.33ms, sd=0.32ms, min=2.19ms, max=4.18ms, median=2.24ms, 95p=3.00ms, 99p=4.04ms
# [ONNX Runtime (FP32)] mean=8.08ms, sd=0.33ms, min=7.78ms, max=10.61ms, median=8.06ms, 95p=8.18ms, 99p=10.55ms
# [ONNX Runtime (optimized)] mean=2.57ms, sd=0.04ms, min=2.38ms, max=2.83ms, median=2.56ms, 95p=2.68ms, 99p=2.73ms
# Each infence engine output is within 0.3 tolerance compared to Pytorch output

It will output mean latency and other statistics.
Usually Nvidia TensorRT is the fastest option and ONNX Runtime is usually a strong second option.
On ONNX Runtime, optimized means that kernel fusion and mixed precision are enabled.
Pytorch is never competitive on transformer inference, including mixed precision, whatever the model size.

Run Nvidia Triton inference server

Note that we install transformers at run time.
For production, it's advised to build your own 3-line Docker image with transformers pre-installed.

docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \
  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
  bash -c "pip install transformers torch==1.12.0 -f https://download.pytorch.org/whl/cu116/torch_stable.html && \
  tritonserver --model-repository=/models"

# output:
# ...
# I0207 09:58:32.738831 1 grpc_server.cc:4195] Started GRPCInferenceService at 0.0.0.0:8001
# I0207 09:58:32.739875 1 http_server.cc:2857] Started HTTPService at 0.0.0.0:8000
# I0207 09:58:32.782066 1 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002

Query inference

Query ONNX models (replace transformer_onnx_inference by transformer_tensorrt_inference to query TensorRT engine):

curl -X POST  http://localhost:8000/v2/models/transformer_onnx_inference/versions/1/infer \
  --data-binary "@demo/infinity/query_body.bin" \
  --header "Inference-Header-Content-Length: 161"

# output:
# {"model_name":"transformer_onnx_inference","model_version":"1","outputs":[{"name":"output","datatype":"BYTES","shape":[],"data":["[{\"entity_group\": \"ORG\", \"score\": 0.9848777055740356, \"word\": \"Infinity\", \"start\": 45, \"end\": 53}]"]}]}

Question Answering (encoder model)

Question Answering models can retrieve the answer to a question from a given text, which is useful for searching for an answer in a document.

Optimize existing model

This will optimize models, generate Triton configuration and Triton folder layout in a single command:

docker run -it --rm --gpus all \
  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \
  bash -c "cd /project && \
    convert_model -m \"distilbert-base-cased-distilled-squad\" \
    --backend tensorrt onnx \
    --seq-len 16 128 384 \
    --task question-answering"

# output:  
# ...
# Inference done on Tesla T4
# latencies:
# [Pytorch (FP32)] mean=8.24ms, sd=0.46ms, min=7.66ms, max=13.91ms, median=8.20ms, 95p=8.38ms, 99p=10.01ms
# [Pytorch (FP16)] mean=6.87ms, sd=0.44ms, min=6.69ms, max=13.05ms, median=6.78ms, 95p=7.33ms, 99p=8.86ms
# [TensorRT (FP16)] mean=2.33ms, sd=0.32ms, min=2.19ms, max=4.18ms, median=2.24ms, 95p=3.00ms, 99p=4.04ms
# [ONNX Runtime (FP32)] mean=8.08ms, sd=0.33ms, min=7.78ms, max=10.61ms, median=8.06ms, 95p=8.18ms, 99p=10.55ms
# [ONNX Runtime (optimized)] mean=2.57ms, sd=0.04ms, min=2.38ms, max=2.83ms, median=2.56ms, 95p=2.68ms, 99p=2.73ms
# Each infence engine output is within 0.3 tolerance compared to Pytorch output

It will output mean latency and other statistics.
Usually Nvidia TensorRT is the fastest option and ONNX Runtime is usually a strong second option.
On ONNX Runtime, optimized means that kernel fusion and mixed precision are enabled.
Pytorch is never competitive on transformer inference, including mixed precision, whatever the model size.

Run Nvidia Triton inference server

Note that we install transformers at run time.
For production, it's advised to build your own 3-line Docker image with transformers pre-installed.

docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 1024m \
  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
  bash -c "pip install transformers torch==1.12.0 -f https://download.pytorch.org/whl/cu116/torch_stable.html && \
  tritonserver --model-repository=/models"

# output:
# ...
# I0207 09:58:32.738831 1 grpc_server.cc:4195] Started GRPCInferenceService at 0.0.0.0:8001
# I0207 09:58:32.739875 1 http_server.cc:2857] Started HTTPService at 0.0.0.0:8000
# I0207 09:58:32.782066 1 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002

Query inference

Query ONNX models (replace transformer_onnx_inference by transformer_tensorrt_inference to query TensorRT engine):

curl -X POST  http://localhost:8000/v2/models/transformer_onnx_inference/versions/1/infer \
  --data-binary "@demo/question-answering/query_body.bin" \
  --header "Inference-Header-Content-Length: 276"

# output:
# {"model_name":"transformer_onnx_inference","model_version":"1","outputs":[{"name":"output","datatype":"BYTES","shape":[],"data":["{\"score\": 0.9925152659416199, \"start\": 34, \"end\": 40, \"answer\": \"Berlin\"}"]}]}

Checkout demo/question-answering/query_bin_gen.ipynb for how to generate the query_body.bin file. More examples of inference can be found in demo/question-answering/

Feature extraction / dense embeddings

Feature extraction in NLP is the task to convert text to dense embeddings.
It has gained some traction as a robust way to improve search engine relevancy (increase recall).
This project supports models from sentence-transformers and it requires a version >= V2.2.0 of sentence-transformers library.

Optimize existing model

docker run -it --rm --gpus all \
  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \
  bash -c "cd /project && \
    convert_model -m \"sentence-transformers/msmarco-distilbert-cos-v5\" \
    --backend tensorrt onnx \
    --task embedding \
    --seq-len 16 128 128"

# output:
# ...
# Inference done on NVIDIA GeForce RTX 3090
# latencies:
# [Pytorch (FP32)] mean=5.19ms, sd=0.45ms, min=4.74ms, max=6.64ms, median=5.03ms, 95p=6.14ms, 99p=6.26ms
# [Pytorch (FP16)] mean=5.41ms, sd=0.18ms, min=5.26ms, max=8.15ms, median=5.36ms, 95p=5.62ms, 99p=5.72ms
# [TensorRT (FP16)] mean=0.72ms, sd=0.04ms, min=0.69ms, max=1.33ms, median=0.70ms, 95p=0.78ms, 99p=0.81ms
# [ONNX Runtime (FP32)] mean=1.69ms, sd=0.18ms, min=1.62ms, max=4.07ms, median=1.64ms, 95p=1.86ms, 99p=2.44ms
# [ONNX Runtime (optimized)] mean=1.03ms, sd=0.09ms, min=0.98ms, max=2.30ms, median=1.00ms, 95p=1.15ms, 99p=1.41ms
# Each infence engine output is within 0.3 tolerance compared to Pytorch output

Run Nvidia Triton inference server

docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \
  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
  bash -c "pip install transformers && tritonserver --model-repository=/models"

# output:
# ...
# I0207 11:04:33.761517 1 grpc_server.cc:4195] Started GRPCInferenceService at 0.0.0.0:8001
# I0207 11:04:33.761844 1 http_server.cc:2857] Started HTTPService at 0.0.0.0:8000
# I0207 11:04:33.803373 1 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002

Query inference

curl -X POST  http://localhost:8000/v2/models/transformer_onnx_inference/versions/1/infer \
  --data-binary "@demo/infinity/query_body.bin" \
  --header "Inference-Header-Content-Length: 161"

# output:
# {"model_name":"transformer_onnx_inference","model_version":"1","parameters":{"sequence_id":0,"sequence_start":false,"sequence_end":false},"outputs":[{"name":"output","datatype":"FP32","shape":[1,768],"data":[0.06549072265625,-0.04327392578125,0.1103515625,-0.007320404052734375,...

Generate text (decoder model)

Text generation seems to be the way to go for NLP.
Unfortunately, they are slow to run, below we will accelerate the most famous of them: GPT-2.

GPT example

We will start with GPT-2 model example, then in the next section we will use T5-model.

Optimize existing model

Like before, command below will prepare Triton inference server stuff.
One point to have in mind is that Triton run:

  • inference engines (ONNX Runtime and TensorRT)
  • Python code in charge of the decoding part. Python code delegate to Triton server the model management.

Python code is in ./triton_models/transformer_tensorrt_generate/1/model.py

docker run -it --rm --gpus all \
  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \
  bash -c "cd /project && \
    convert_model -m gpt2 \
    --backend tensorrt onnx \
    --seq-len 6 256 256 \
    --task text-generation"

# output:
# ...
# Inference done on NVIDIA GeForce RTX 3090
# latencies:
# [Pytorch (FP32)] mean=9.43ms, sd=0.59ms, min=8.95ms, max=15.02ms, median=9.33ms, 95p=10.38ms, 99p=12.46ms
# [Pytorch (FP16)] mean=9.92ms, sd=0.55ms, min=9.50ms, max=15.06ms, median=9.74ms, 95p=10.96ms, 99p=12.26ms
# [TensorRT (FP16)] mean=2.19ms, sd=0.18ms, min=2.06ms, max=3.04ms, median=2.10ms, 95p=2.64ms, 99p=2.79ms
# [ONNX Runtime (FP32)] mean=4.99ms, sd=0.38ms, min=4.68ms, max=9.09ms, median=4.78ms, 95p=5.72ms, 99p=5.95ms
# [ONNX Runtime (optimized)] mean=3.93ms, sd=0.40ms, min=3.62ms, max=6.53ms, median=3.81ms, 95p=4.49ms, 99p=5.79ms
# Each infence engine output is within 0.3 tolerance compared to Pytorch output

Two detailed notebooks are available:

Optimize existing large model

To optimize models which typically don't fit twice onto a single GPU, run the script as follows:

docker run -it --rm --shm-size=24g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \
  bash -c "cd /project && \
    convert_model -m gpt2-medium \
    --backend tensorrt onnx \
    --seq-len 6 256 256 \
    --fast \
    --atol 3 \
    --task text-generation"

The larger the model gets, the more likely it is that you need to also increase the absolute tolerance of the script. Additionally, some models may return a message similar to: Converted FP32 value in weights (either FP32 infinity or FP32 value outside FP16 range) to corresponding FP16 infinity. It is best to test and evaluate the model afterwards to understand the implications of this conversion.

Depending on model size this may take really long. GPT Neo 2.7B can easily take 1 hour of conversion or more.

Run Nvidia Triton inference server

To run decoding algorithm server side, we need to install Pytorch on Triton docker image.

docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 8g \
  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
  bash -c "pip install transformers torch==1.12.0 -f https://download.pytorch.org/whl/cu116/torch_stable.html && \
  tritonserver --model-repository=/models"

# output:
# ...
# I0207 10:29:19.091191 1 grpc_server.cc:4195] Started GRPCInferenceService at 0.0.0.0:8001
# I0207 10:29:19.091417 1 http_server.cc:2857] Started HTTPService at 0.0.0.0:8000
# I0207 10:29:19.132902 1 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002

Query inference

Replace transformer_onnx_generate by transformer_tensorrt_generate to query TensorRT engine.

curl -X POST  http://localhost:8000/v2/models/transformer_onnx_generate/versions/1/infer \
  --data-binary "@demo/infinity/query_body.bin" \
  --header "Inference-Header-Content-Length: 161"

# output:
# {"model_name":"transformer_onnx_generate","model_version":"1","outputs":[{"name":"output","datatype":"BYTES","shape":[],"data":["This live event is great. I will sign-up for Infinity.\n\nI'm going to be doing a live stream of the event.\n\nI"]}]}

Ok, the output is not very interesting (💩 in -> 💩 out) but you get the idea.
Source code of the generative model is in ./triton_models/transformer_tensorrt_generate/1/model.py.
You may want to tweak it regarding your needs (default is set for greedy search and output 64 tokens).

Python code

You may be interested in running optimized text generation on Python directly, without using any inference server:

docker run -p 8888:8888 -v $PWD/demo/generative-model:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \
  bash -c "cd /project && jupyter notebook --ip 0.0.0.0 --port 8888 --no-browser --allow-root"

T5-small example

In this section we will present the t5-small model conversion.

Optimize existing large model

To optimize model run the script as follows:

docker run -it --rm --shm-size=24g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \
  bash -c "cd /project && \
    convert_model -m t5-small \
    --backend onnx \
    --seq-len 16 256 256 \
    --task text-generation \
    --nb-measures 100 \
    --generative-model t5 \
    --output triton_models"

Run Nvidia Triton inference server

To run decoding algorithm server side, we need to install Pytorch on Triton docker image.

docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 8g \
  -v $PWD/triton_models/:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
  bash -c "pip install onnx onnxruntime-gpu transformers==4.21.3 git+https://github.com/ELS-RD/transformer-deploy torch==1.12.0 -f https://download.pytorch.org/whl/cu116/torch_stable.html onnx onnxruntime-gpu && \
  tritonserver --model-repository=/models"

To test text generation, you can try this request:

curl -X POST http://localhost:8000/v2/models/t5_model_generate/versions/1/infer --data-binary "@demo/generative-model/t5_query_body.bin" --header "Inference-Header-Content-Length: 181"

# output:
# {"model_name":"t5_model_generate","model_version":"1","outputs":[{"name":"OUTPUT_TEXT","datatype":"BYTES","shape":[],"data":["Mein Name mein Wolfgang Wolfgang und ich wohne in Berlin."]}]}

Query inference

Replace transformer_onnx_generate by transformer_tensorrt_generate to query TensorRT engine.

curl -X POST  http://localhost:8000/v2/models/transformer_onnx_inference/versions/1/infer \
  --data-binary "@demo/infinity/seq2seq_query_body.bin" \
  --header "Inference-Header-Content-Length: 176"

Model quantization on GPU

Quantization is a generic method to get X2 speedup on top of other inference optimization.
GPU quantization on transformers is almost never used because it requires to modify model source code.

We have implemented in this library a mechanism which updates Hugging Face transformers library to support quantization.
It makes it easy to use.

To play with it, open this notebook:

docker run -p 8888:8888 -v $PWD/demo/quantization:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \
  bash -c "cd /project && jupyter notebook --ip 0.0.0.0 --port 8888 --no-browser --allow-root"

See our documentation for detailed instructions on how to use the package, including setup, GPU quantization support and Nvidia Triton inference server deployment.

transformer-deploy's People

Contributors

averkij avatar ayoub-louati avatar fursovia avatar jonathlela avatar kamalkraj avatar pommedeterresautee avatar sam-writer avatar thytu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

transformer-deploy's Issues

Build OnnxRuntime Error

Hi, I was following the compilation steps in the t5 notebook to build OnnxRuntime. But when I run
git checkout -b fix_if e1c04eed29d48f295de1cfbd48713158537cdaa7, the output is:
fatal: reference is not a tree: e1c04eed29d48f295de1cfbd48713158537cdaa7.

Previously I ran:
git clone --recursive https://github.com/Microsoft/onnxruntime cd onnxruntime as suggested.

I wonder what is the reason?

big performance difference on tensorRT

Hi, I just tried the demo code below, in your result, the [TensorRT (FP16)] result is much better than others. However, the results I got are quite different. there is not such a big difference between [TensorRT (FP16)] and others (the output is attached). I wonder if you know what happened or how I can figure out the reason for that. Thank you.

docker run -it --rm --gpus all \
  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
  bash -c "cd /project && \
    convert_model -m \"philschmid/MiniLM-L6-H384-uncased-sst2\" \
    --backend tensorrt onnx \
    --seq-len 16 128 128"
Inference done on Tesla M60
latencies:
[Pytorch (FP32)] mean=6.31ms, sd=1.32ms, min=4.48ms, max=10.75ms, median=6.39ms, 95p=8.63ms, 99p=9.33ms
[Pytorch (FP16)] mean=8.81ms, sd=2.02ms, min=6.59ms, max=55.42ms, median=8.70ms, 95p=11.20ms, 99p=12.16ms
**### [TensorRT (FP16)] mean=4.59ms, sd=1.97ms, min=2.27ms, max=10.38ms, median=4.47ms, 95p=8.02ms, 99p=8.86ms**
[ONNX Runtime (FP32)] mean=5.03ms, sd=2.00ms, min=2.64ms, max=10.45ms, median=5.16ms, 95p=8.37ms, 99p=9.17ms
[ONNX Runtime (optimized)] mean=5.19ms, sd=2.04ms, min=2.80ms, max=10.59ms, median=5.25ms, 95p=8.67ms, 99p=9.40ms

"faster inference compared to vanilla Pytorch"-- Pytorch CPU or GPU?

Then, if you spend some time, you can build something over ONNX Runtime and Triton inference server. You will usually get from 2X to 4X faster inference compared to vanilla Pytorch. It's cool!

When you say "faster compared to vanilla Pytorch", are you saying faster than vanilla Pytorch in CPU or GPU?

convert to onnx model takes more time on CPU

converting model to onnx with device as cpu takes around 8 minutes

docker run -it --rm --gpus all \
  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
  bash -c "cd /project && \
    convert_model -m \"distilbert-base-uncased-finetuned-sst-2-english\" \
    --backend onnx \
    --seq-len 16 128 128 --device cpu"

The same model is converted to onnx using GPU in less than 2 minutes

docker run -it --rm --gpus all \
  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
  bash -c "cd /project && \
    convert_model -m \"distilbert-base-uncased-finetuned-sst-2-english\" \
    --backend onnx \
    --seq-len 16 128 128"

using huggingface default onnx converter only takes ~20s

python -m transformers.onnx --model=distilbert-base-uncased-finetuned-sst-2-english --feature sequence-classification    onnx/

@pommedeterresautee

error in triton config files

After running convery.py on a sentence-transformer model, triton config along with onnx model gets generated inside triton_models directory.

However, while serving the model using same config, below error is observed.

UNAVAILABLE: Invalid argument: model 'transformer_onnx_model', tensor 'output': the model expects 2 dimensions (shape [-1,-1]) but the model configuration specifies 2

quantized model inference slow with Triton server than inference directly in python code(notebook)

I follow the guide quantization_end_to_end.ipynb to quantize my custom classification model base on Roberta-base-12layers and finally get the following Latency measures from different backend, the results looks good:
batch size = 1
seq_length = 128
device = GPU Tesla T4

[Pytorch (FP32)] mean=9.66ms, sd=0.19ms, min=9.38ms, max=10.08ms, median=9.60ms, 95p=10.07ms, 99p=10.08ms
[Pytorch (FP16)] mean=9.84ms, sd=0.25ms, min=9.77ms, max=12.27ms, median=9.80ms, 95p=9.93ms, 99p=10.01ms

[ONNX (FP32)] mean=8.41ms, sd=0.70ms, min=7.92ms, max=10.01ms, median=8.08ms, 95p=10.00ms, 99p=10.01ms
[ONNX (FP16)] mean=2.57ms, sd=0.07ms, min=2.40ms, max=3.10ms, median=2.57ms, 95p=2.62ms, 99p=2.63ms

[TensorRT (FP16)] mean=2.54ms, sd=0.46ms, min=2.32ms, max=3.94ms, median=2.39ms, 95p=3.92ms, 99p=3.93ms
[TensorRT (INT-8)] mean=1.98ms, sd=0.17ms, min=1.87ms, max=3.57ms, median=1.95ms, 95p=2.06ms, 99p=2.30ms

but after I deployed the ONNX FP16 model and TesorRT model (both FP16 and INT-8 ) with Triton Server and then do stress test with Jmeter , the result showed that TensorRT INT-8 model is not faster than FP16 model:
batch size = 1
seq_length = 128
threads = 20
GPU utilization between 93%~95%

Triton server with TensorRT INT-8 model, throughput = 398.7/sec
Triton server with TensorRT FP16 model, throughput = 399.1/sec
Triton server with ONNX FP16 model, throughput = 363.3/sec

I just wonder what's wrong with the Triton Server, is it has a "int8 inference" option and I didn't turn it on?

Export of Large Models Fails: onnx2trt_utils.cpp:1571

This issue is a direct consequence of: onnx/onnx-tensorrt#818

/usr/local/lib/python3.8/dist-packages/transformers/models/gpt2/modeling_gpt2.py:196: TracerWarning: Converting a tensor to a Python float might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  attn_weights = attn_weights / (float(value.size(-1)) ** 0.5)
[03/14/2022-09:03:29] [TRT] [E] onnx2trt_utils.cpp:1571: Failed to open file: transformer.wte.weight
[03/14/2022-09:03:29] [TRT] [E] 4: [network.cpp::validate::2633] Error Code 4: Internal Error (Network must have at least one output)
[03/14/2022-09:03:29] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::609] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )
Traceback (most recent call last):
  File "/usr/local/bin/convert_model", line 8, in <module>
    sys.exit(entrypoint())
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 357, in entrypoint
    main(commands=args)
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 262, in main
    engine: ICudaEngine = build_engine(
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/backends/trt_utils.py", line 126, in build_engine
    engine: ICudaEngine = runtime.deserialize_cuda_engine(trt_engine)
TypeError: deserialize_cuda_engine(): incompatible function arguments. The following argument types are supported:
    1. (self: tensorrt.tensorrt.Runtime, serialized_engine: buffer) -> tensorrt.tensorrt.ICudaEngine

Invoked with: <tensorrt.tensorrt.Runtime object at 0x7fe431b73a30>, None

Steps to reproduce:

Works:

docker run -it --rm --gpus device=6 \
    --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
    -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
    bash -c "cd /project && \
        convert_model -m distilgpt2 \
        --backend tensorrt \
        --seq-len 1 128 128 \
        --task text-generation"

Doesn't work:

docker run -it --rm --gpus device=6 \
    --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
    -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
    bash -c "cd /project && \
        convert_model -m gpt2-large \
        --backend tensorrt \
        --seq-len 1 128 128 \
        --task text-generation"

Speed difference ONNX vs TensorRT with samples sorted by sequence length

I noticed something unexpected when comparing two scenarios for a model converted via ONNX and TensorRT (distilroberta with classification head):

  1. Scenario: I use a dataset with varying sentence lengths (~20-60 tokens) and run it randomly sampled through both models
  2. Scenario: I use the same dataset but sort the sentences by sentence length (decreasing) before running it through both models

Result: The TensorRT model does not seem to care about the sequence lengths and keeps the same speed for both scenarios. The ONNX model, however, gets almost twice as fast when I use the second scenario.

I was wondering if tensorRT's optimization does somehow require to pad to the max length internally. I was searching for a parameter or a reason for this behavior but couldn't find anything useful. For conversion, I set the seq-len parameter to 1 60 60.

I was wondering if perhaps someone else has already observed this and knows the reason / a solution.

GPT-2 pipeline?

Hello,
thank you for this wonderful implementation.

Do you have any plans implementing notebook with gpt-2 support?
It seems that there would be huge speed benefit, especially with smaller sequence lengths and higher batches.

got error in optimize onnx when ran gpt2 file from demo/generative-model

getting error when ran this code part
logging.basicConfig()
logging.getLogger().setLevel(logging.INFO)
num_attention_heads, hidden_size = get_model_size(path=model_name)
optimize_onnx(
onnx_path="test-gpt2.onnx",
onnx_optim_model_path="test-gpt2-opt.onnx",
fp16=True,
use_cuda=True,
num_attention_heads=num_attention_heads,
hidden_size=hidden_size,
architecture='gpt2'
)

INFO:fusion_base:Fused LayerNormalization count: 25
INFO:fusion_base:Fused FastGelu count: 12

failed in shape inference <class 'AssertionError'>
failed in shape inference <class 'AssertionError'>
failed in shape inference <class 'AssertionError'>

INFO:onnx_model:Graph pruned: 0 inputs, 0 outputs and 720 nodes are removed
INFO:onnx_model_gpt2:postprocess: remove Reshape count:72
INFO:fusion_base:Fused FastGelu(add bias) count: 12
INFO:onnx_model_bert:opset verion: 13


AssertionError Traceback (most recent call last)

in ()
9 num_attention_heads=num_attention_heads,
10 hidden_size=hidden_size,
---> 11 architecture='gpt2'
12 )

7 frames

/usr/local/lib/python3.7/dist-packages/onnxruntime/transformers/../tools/symbolic_shape_infer.py in add_suggested_merge(self, symbols, apply)
209
210 def add_suggested_merge(self, symbols, apply=False):
--> 211 assert all([(type(s) == str and s in self.symbolic_dims
) or is_literal(s) for s in symbols])
212 symbols = set(symbols)
213 for k, v in self.suggested_merge
.items():

AssertionError:

ONNX opset mismatch error.

Full error message

+--------------------------------+---------+----------------------------------------------------------------------------------------------------------------------------------+
| Model                          | Version | Status                                                                                                                           |
+--------------------------------+---------+----------------------------------------------------------------------------------------------------------------------------------+
| transformer_onnx_model         | 1       | UNAVAILABLE: Internal: onnx runtime error 1: Load model from /models/transformer_onnx_model/1/model.bin failed:/workspace/onnxru |
|                                |         | ntime/onnxruntime/core/graph/model_load_utils.h:47 void onnxruntime::model_load_utils::ValidateOpsetForDomain(const std::unorder |
|                                |         | ed_map<std::__cxx11::basic_string<char>, int>&, const onnxruntime::logging::Logger&, bool, const string&, int) ONNX Runtime only |
|                                |         |  *guarantees* support for models stamped with official released onnx opset versions. Opset 3 is under development and support fo |
|                                |         | r this is limited. The operator schemas and or other functionality may change before next ONNX release and in this case ONNX Run |
|                                |         | time will not guarantee backward compatibility. Current official support for domain ai.onnx.ml is till opset 2.                  |
| transformer_onnx_tokenize      | 1       | READY                                                                                                                            |
| transformer_tensorrt_inference | 1       | READY                                                                                                                            |
| transformer_tensorrt_model     | 1       | READY                                                                                                                            |
| transformer_tensorrt_tokenize  | 1       | READY                                                                                                                            |
+--------------------------------+---------+----------------------------------------------------------------------------------------------------------------------------------+

cmd to reproduce

git clone [email protected]:ELS-RD/transformer-deploy.git
cd transformer-deploy
# Build
make docker_build
# Generate model
docker run -it --rm --gpus all \                                                    
  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \           
  bash -c "cd /project && pip install protobuf==3.20.1 && \                    
    convert_model -m \"philschmid/MiniLM-L6-H384-uncased-sst2\" \
    --backend tensorrt onnx \
    --seq-len 16 128 128"
# Run model
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \
  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.02-py3 \
  bash -c "pip install transformers && tritonserver --model-repository=/models"

@pommedeterresautee

Support for gpt2 quantization

I tried to quantize (add QDQ layers) the gpt2 model:

batch_size=8
        with QATCalibrate(method="histogram", percentile=99.999) as qat:
            model_q = self.model.cuda()
            qat.setup_model_qat(model_q)  # prepare quantizer to any model

            with torch.no_grad():
                for start_index in range(0, 650, batch_size):
                    end_index = start_index + batch_size
                    data = self.data[start_index:end_index]
                    data = self.tokenizer(data, return_tensors='pt', padding=True, truncation=True, max_length=512)
                    input_torch = {
                        k: torch.tensor(v, dtype=torch.long, device="cuda")
                        for k, v in data.items()
                        if k in ["input_ids", "attention_mask", "token_type_ids"]
                    }
                    model_q(**input_torch)

but no QDQ layers were inserted - I assume that you don't support GPT2 yet. Do you plan add it?

Installation instructions

The line

pip3 install .[GPU] -f https://download.pytorch.org/whl/cu113/torch_stable.html

did not work for me as written, but it did with quotes:

pip3 install ".[GPU]" -f https://download.pytorch.org/whl/cu113/torch_stable.html

Also, the instructions say

make docker_build

but the Makefile does not contain such a rule, only build_docker so the command should probably be

make build_docker.

Installing transformers inside nvidia docker container

Trying to run triton inference server using

docker run --rm -p8005:8005 -p8003:8003 -p8004:8004 -v/home/test/triton-serve/server/docs/examples/model_repository/triton_models:/models nvcr.io/nvidia/tritonserver:21.12-py3 tritonserver --model-repository=/models

Gives below error:
UNAVAILABLE: Internal: ModuleNotFoundError: No module named 'transformers'

'assert num_heads > 0' error with DistilBert

I get the following error when I try to optimize distilbert:

AssertionError                            Traceback (most recent call last)
<timed eval> in <module>

/opt/conda/lib/python3.7/site-packages/transformer_deploy/convert.py in main(input_args)
    245             onnx_path=onnx_model_path,
    246             onnx_optim_fp16_path=onnx_optim_fp16_path,
--> 247             use_cuda=True,
    248         )
    249         onnx_model = create_model_for_provider(path=onnx_optim_fp16_path, provider_to_use="CUDAExecutionProvider")

/opt/conda/lib/python3.7/site-packages/transformer_deploy/backends/ort_utils.py in optimize_onnx(onnx_path, onnx_optim_fp16_path, use_cuda)
     72         num_heads=0,  # automatic detection don't work with opset 13
     73         hidden_size=0,  # automatic detection
---> 74         optimization_options=optimization_options,
     75     )
     76 

/opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/optimizer.py in optimize_model(input, model_type, num_heads, hidden_size, optimization_options, opt_level, use_gpu, only_onnxruntime)
    289 
    290     if not only_onnxruntime:
--> 291         optimizer.optimize(optimization_options)
    292 
    293     # Remove the temporary model.

/opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/onnx_model_bert.py in optimize(self, options, add_dynamic_axes)
    317             if options is not None:
    318                 self.attention_mask.set_mask_format(options.attention_mask_format)
--> 319             self.fuse_attention()
    320 
    321         self.fuse_shape()

/opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/onnx_model_bert.py in fuse_attention(self)
     52 
     53     def fuse_attention(self):
---> 54         self.attention_fusion.apply()
     55 
     56     def fuse_gelu(self):

/opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/fusion_base.py in apply(self)
     41                     raise Exception("Can not find node in any graphs")
     42                 self.this_graph_name = graph.name
---> 43                 self.fuse(node, input_name_to_nodes, output_name_to_node)
     44 
     45         op_list = [node.op_type for node in self.nodes_to_add]

/opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/fusion_attention.py in fuse(self, normalize_node, input_name_to_nodes, output_name_to_node)
    444             new_node = self.create_attention_node(mask_index, matmul_q, matmul_k, matmul_v, add_q, add_k, add_v,
    445                                                   q_num_heads, self.hidden_size, root_input,
--> 446                                                   attention_last_node.output[0], add_qk_str)
    447             if new_node is None:
    448                 return

/opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/fusion_attention.py in create_attention_node(self, mask_index, q_matmul, k_matmul, v_matmul, q_add, k_add, v_add, num_heads, hidden_size, input, output, add_qk_str)
    161             Union[NodeProto, None]: the node created or None if failed.
    162         """
--> 163         assert num_heads > 0
    164 
    165         if hidden_size > 0 and (hidden_size % num_heads) != 0:

AssertionError: 

While trying to resolve the issue, I observed that it did not occur when optimizer from onnxruntime-tools was used with opt_level 99 (instead of the one in onnxruntime.transformers). But the code then threw Exceptions due to some skip layer normalization issues.

nvidia-pyindex installation

Installation of packages which depend on nvidia-pyindex fail if nvidia-pyindex is not installed before installing transformer_deploy.
My initial guess was that this occurs because setuptools does not install packages in the order specified in the extras_require argument. I tried adding the package to setup_requires & install_requires arguments of setuptools.set in setup.py, but it did not help.

T5 demo breaking

Hi, thank you for the recent updates on T5!

I have been testing the T5 demo notebook and I noticed a part breaking. The bitwise operator does not work with floating points. (torch.any(tensor < resolution & tensor > -resolution & tensor != 0))

Environment: Dockerfile master + pip install nvtx seaborn

  • Additional information:
    • nvidia-tensorrt==8.2.4.2
    • nvtx==0.2.5
    • onnx==1.11.0
    • nvidia-cublas-cu11==2022.4.8
    • nvidia-cublas-cu117==11.10.1.25
    • nvidia-cuda-runtime-cu11==2022.4.25
    • nvidia-cuda-runtime-cu117==11.7.60
    • nvidia-cudnn-cu11==2022.5.19
    • nvidia-cudnn-cu116==8.4.0.27

Steps to reproduce: Run T5 demo notebook to cell In[5]:

def get_random_input_encoder() -> Dict[str, torch.Tensor]:
    max_seq = 512
    seq_len = random.randint(a=1, b=max_seq)
    batch = max_seq // seq_len
    random_input_ids = torch.randint(
        low=0, high=tokenizer.vocab_size, size=(batch, seq_len), dtype=torch.int32, device="cuda"
    )
    inputs = {"input_ids": random_input_ids}
    return inputs


keep_fp32_encoder = get_keep_fp32_nodes(onnx_model_path=encoder_model_path, get_input=get_random_input_encoder)
assert len(keep_fp32_encoder) > 0
enc_model_onnx = convert_fp16(onnx_model=encoder_model_path, nodes_to_exclude=keep_fp32_encoder)
save_onnx(proto=enc_model_onnx, model_path=encoder_fp16_model_path)

del enc_model_onnx
torch.cuda.empty_cache()
gc.collect()

Result: Error

RuntimeError                              Traceback (most recent call last)
Input In [12], in <cell line: 12>()
      8     inputs = {"input_ids": random_input_ids}
      9     return inputs
---> 12 keep_fp32_encoder = get_keep_fp32_nodes(onnx_model_path=encoder_model_path, get_input=get_random_input_encoder)
     13 assert len(keep_fp32_encoder) > 0
     14 enc_model_onnx = convert_fp16(onnx_model=encoder_model_path, nodes_to_exclude=keep_fp32_encoder)

File /usr/local/lib/python3.8/dist-packages/transformer_deploy/backends/ort_utils.py:372, in get_keep_fp32_nodes(onnx_model_path, get_input, early_stop, device)
    368 inputs = get_input()
    369 outputs: Dict[str, torch.Tensor] = inference_onnx_binding(
    370     model_onnx=ort_model_fp32_all_nodes, inputs=inputs, device=device, binding=ort_binding, clone_tensor=False
    371 )
--> 372 keep_node_io = find_node_fp32(graph=output_mapping, output_nodes=outputs)
    374 nodes_to_add = [n for n in keep_node_io if n not in keep_fp32_nodes]
    375 keep_fp32_nodes += nodes_to_add

File /usr/local/lib/python3.8/dist-packages/transformer_deploy/backends/ort_utils.py:304, in find_node_fp32(graph, output_nodes)
    299     # out of FP16 range
    300     print("Tensor is ", tensor)
    301     if (
    302         torch.any(tensor > max_float16)
    303         or torch.any(tensor < min_float16)
--> 304         or (torch.any(tensor < resolution & tensor > -resolution & tensor != 0))  # limited memory footprint
    305     ):
    306         keep_fp32.append(graph[k])
    307 return keep_fp32

RuntimeError: "bitwise_and_cuda" not implemented for 'Float'

Calibration failure occurred with no scaling factors detected

Hey,

first of all, thanks a lot for your great work. This repo was already a great help to me.

With your quantization update for INT8, however, I ran into a problem. As soon as I activate --quantization, I get the following error:

[01/14/2022-11:18:37] [TRT] [W] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32 or Bool.
[01/14/2022-11:18:37] [TRT] [E] 4: [standardEngineBuilder.cpp::initCalibrationParams::1402] Error Code 4: Internal Error (Calibration failure occurred with no scaling factors detected. This could be due to no int8 calibrator or insufficient custom scales for network layers. Please see int8 sample to setup calibration correctly.)
[01/14/2022-11:18:37] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::609] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )

Traceback (most recent call last):
  File "/data/repos/transformer-deploy/src/transformer_deploy/convert.py", line 326, in <module>
    entrypoint()
  File "/data/repos/transformer-deploy/src/transformer_deploy/convert.py", line 322, in entrypoint
    main(commands=args)
  File "/data/repos/transformer-deploy/src/transformer_deploy/convert.py", line 216, in main
    engine: ICudaEngine = build_engine(
  File "/data/repos/transformer-deploy/src/transformer_deploy/backends/trt_utils.py", line 181, in build_engine
    engine: ICudaEngine = runtime.deserialize_cuda_engine(trt_engine)
TypeError: deserialize_cuda_engine(): incompatible function arguments. The following argument types are supported:
    1. (self: tensorrt.tensorrt.Runtime, serialized_engine: buffer) -> tensorrt.tensorrt.ICudaEngine

Invoked with: <tensorrt.tensorrt.Runtime object at 0x7feb14128e30>, None

The problem in the traceback is then just that the trt_engine will be None. I don't get any other warnings or errors, so I'm a bit at a loss. I've tried with distilroberta-base and also with bert-base-uncased, but I get the same error each time. Did you, by any chance, run into the same problem at some point in time or do you see what the issue may be?

Thanks a lot in advance!

Dynamic batching does not give better latency for Roberta running on TensorRT.

Hi, I used your build_engine API to convert the Roberta model. While building if I use the constant batch size for input_shapes, i.e. (min, optimal, max) -> (1,1,1) or (4, 4, 4,). The model yields good results (faster than ort and torch).

But when I convert it with dynamic batch size i.e. (min, optimal, max) -> (1, 4, 4), the model performs really slow compared to ort or torch.

code to understand the problem better:

# fast inference but constrained to use always 4 batches during inferencing
tensor_shapes = list(zip([4, 4, 4], [1, 128, 128]))

# slow inference
tensor_shapes = list(zip([1, 4, 4], [1, 128, 128]))

engine: ICudaEngine = build_engine(
    runtime=runtime,
    onnx_file_path=onnx_model_path,
    logger=trt_logger,
    min_shape=tensor_shapes[0],
    optimal_shape=tensor_shapes[1],
    max_shape=tensor_shapes[2],
    workspace_size=workspace_size * 1024**3,
    fp16=not quantization,
    int8=quantization,
    profiling=True,
)

save_engine(engine=engine, engine_file_path=tensorrt_path)

the complete build and inference logs for slow inference case (when converting with dynamic batch)

[06/02/2022-03:19:09] [TRT] [I] [MemUsageChange] Init CUDA: CPU +312, GPU +0, now: CPU 3789, GPU 2470 (MiB)
[06/02/2022-03:19:09] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 3790, GPU 2470 (MiB)
[06/02/2022-03:19:09] [TRT] [I] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 3790 MiB, GPU 2470 MiB
[06/02/2022-03:19:09] [TRT] [I] [MemUsageSnapshot] End constructing builder kernel library: CPU 3924 MiB, GPU 2504 MiB
[06/02/2022-03:19:09] [TRT] [I] parsing TensorRT model
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1418322027
[06/02/2022-03:19:22] [TRT] [W] onnx2trt_utils.cpp:366: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[06/02/2022-03:19:39] [TRT] [W] Output type must be INT32 for shape outputs
[06/02/2022-03:19:39] [TRT] [W] Output type must be INT32 for shape outputs
[06/02/2022-03:19:39] [TRT] [W] Output type must be INT32 for shape outputs
[06/02/2022-03:19:43] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +512, GPU +226, now: CPU 5802, GPU 2730 (MiB)
[06/02/2022-03:19:43] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +116, GPU +52, now: CPU 5918, GPU 2782 (MiB)
[06/02/2022-03:19:43] [TRT] [I] Timing cache disabled. Turning it on will improve builder speed.
[06/02/2022-03:19:43] [TRT] [W] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are: 
[06/02/2022-03:19:43] [TRT] [W]  (# 1 (SHAPE input_ids))
[06/02/2022-03:19:43] [TRT] [W]  (# 0 (SHAPE attention_mask))
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
[06/02/2022-03:25:32] [TRT] [W] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are: 
[06/02/2022-03:25:32] [TRT] [W]  (# 1 (SHAPE input_ids))
[06/02/2022-03:25:32] [TRT] [W]  (# 0 (SHAPE attention_mask))
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
[06/02/2022-03:30:10] [TRT] [I] Detected 2 inputs and 1 output network tensors.
[06/02/2022-03:30:10] [TRT] [W] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are: 
[06/02/2022-03:30:10] [TRT] [W]  (# 1 (SHAPE input_ids))
[06/02/2022-03:30:10] [TRT] [W]  (# 0 (SHAPE attention_mask))
[06/02/2022-03:30:32] [TRT] [I] Total Host Persistent Memory: 208
[06/02/2022-03:30:32] [TRT] [I] Total Device Persistent Memory: 0
[06/02/2022-03:30:32] [TRT] [I] Total Scratch Memory: 442827264
[06/02/2022-03:30:32] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 774 MiB, GPU 2058 MiB
[06/02/2022-03:30:32] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 0.038945ms to assign 4 blocks to 4 nodes requiring 443041280 bytes.
[06/02/2022-03:30:32] [TRT] [I] Total Activation Memory: 443041280
[06/02/2022-03:30:32] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5993, GPU 4298 (MiB)
[06/02/2022-03:30:32] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 5993, GPU 4306 (MiB)
[06/02/2022-03:30:32] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +1353, now: CPU 0, GPU 1353 (MiB)
[06/02/2022-03:30:33] [TRT] [I] Loaded engine size: 1364 MiB
[06/02/2022-03:30:33] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 7354, GPU 4282 (MiB)
[06/02/2022-03:30:33] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 7355, GPU 4290 (MiB)
[06/02/2022-03:30:33] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1352, now: CPU 0, GPU 1352 (MiB)
[06/02/2022-03:30:38] [TRT] [I] Loaded engine size: 1364 MiB
[06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 7366, GPU 5636 (MiB)
[06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 7367, GPU 5644 (MiB)
[06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1352, now: CPU 0, GPU 2704 (MiB)
[06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 6002, GPU 5636 (MiB)
[06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 6002, GPU 5644 (MiB)
[06/02/2022-03:30:43] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +423, now: CPU 0, GPU 3127 (MiB)

latencies in ms
--------------------------------------------------
Pytorch 
--------------------------------------------------
[93.5968, 94.0308, 94.8224, 93.6746, 94.5972, 94.0188, 92.3105, 93.6535, 92.4908, 91.4413]
--------------------------------------------------
Onnxruntime 
 --------------------------------------------------
[81.445, 81.3684, 80.2145, 81.5339, 82.9578, 83.6845, 83.6738, 82.6652, 81.5462, 82.8237]
--------------------------------------------------
TensorRT (FP16) 
 --------------------------------------------------
[426.353, 425.1992, 426.0317, 425.8226, 426.8828, 428.0485, 426.3119, 426.4556, 425.4863, 426.0393]
--------------------------------------------------

Is this the expected behavior?

I want to convert the model to use dynamic batches. When inferencing, the model should be able to handle a variable batch size and perform faster. How can I achieve that?

Any help would be greatly appreciated, thank you in advance.

TRT error on fresh install

Running the example command on a fresh install, I get:

$ convert_model -m roberta-large-mnli --backend tensorrt onnx pytorch --seq-len 16 128 128 --batch-size 1 32 32

...
    engine = build_engine(
  File "/home/sam_havens/transformer-deploy/venv/lib/python3.8/site-packages/transformer_deploy/backends/trt_utils.py", line 181, in build_engine
    engine: ICudaEngine = runtime.deserialize_cuda_engine(trt_engine)
TypeError: deserialize_cuda_engine(): incompatible function arguments. The following argument types are supported:
    1. (self: tensorrt.tensorrt.Runtime, serialized_engine: buffer) -> tensorrt.tensorrt.ICudaEngine

Invoked with: <tensorrt.tensorrt.Runtime object at 0x7f5df2f18bb0>, None
Full error
$ convert_model -m roberta-large-mnli --backend tensorrt onnx pytorch --seq-len 16 128 128 --batch-size 1 32 32

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1421594533
[12/08/2021-00:18:52] [TRT] [W] onnx2trt_utils.cpp:366: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[12/08/2021-00:19:08] [TRT] [W] Output type must be INT32 for shape outputs
[12/08/2021-00:19:08] [TRT] [W] Output type must be INT32 for shape outputs
[12/08/2021-00:19:08] [TRT] [W] Output type must be INT32 for shape outputs
[12/08/2021-00:19:12] [TRT] [W] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are:
[12/08/2021-00:19:12] [TRT] [W]  (# 1 (SHAPE input_ids))
[12/08/2021-00:19:12] [TRT] [W]  (# 0 (SHAPE attention_mask))
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
[12/08/2021-00:19:39] [TRT] [W] Skipping tactic 0 due to Myelin error: No results returned from cublas heuristic search
[12/08/2021-00:19:39] [TRT] [W] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are:
[12/08/2021-00:19:39] [TRT] [W]  (# 1 (SHAPE input_ids))
[12/08/2021-00:19:39] [TRT] [W]  (# 0 (SHAPE attention_mask))
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
[12/08/2021-00:20:10] [TRT] [W] Skipping tactic 0 due to Myelin error: No results returned from cublas heuristic search
[12/08/2021-00:20:10] [TRT] [E] 10: [optimizer.cpp::computeCosts::2011] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[roberta.embeddings.token_type_embeddings.weight...(Unnamed Layer* 3884) [Shuffle]]}.)
[12/08/2021-00:20:10] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::609] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )
Traceback (most recent call last):
  File "/home/sam_havens/transformer-deploy/venv/bin/convert_model", line 11, in <module>
    load_entry_point('transformer-deploy==0.1.1', 'console_scripts', 'convert_model')()
  File "/home/sam_havens/transformer-deploy/venv/lib/python3.8/site-packages/transformer_deploy/convert.py", line 129, in main
    engine = build_engine(
  File "/home/sam_havens/transformer-deploy/venv/lib/python3.8/site-packages/transformer_deploy/backends/trt_utils.py", line 181, in build_engine
    engine: ICudaEngine = runtime.deserialize_cuda_engine(trt_engine)
TypeError: deserialize_cuda_engine(): incompatible function arguments. The following argument types are supported:
    1. (self: tensorrt.tensorrt.Runtime, serialized_engine: buffer) -> tensorrt.tensorrt.ICudaEngine

Invoked with: <tensorrt.tensorrt.Runtime object at 0x7f5df2f18bb0>, None

If you have any ideas, I'd appreciate it! Thank you.

EDIT:

nvidia-smi
$ nvidia-smi
Wed Dec  8 00:35:05 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8    10W /  70W |    105MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1030      G   /usr/lib/xorg/Xorg                 95MiB |
|    0   N/A  N/A      1213      G   /usr/bin/gnome-shell                8MiB |
+-----------------------------------------------------------------------------+
pip freeze output
$ pip freeze
anyio==3.4.0
appdirs==1.4.4
asgiref==3.4.1
attrs==21.2.0
black==21.12b0
Brotli==1.0.9
certifi==2021.10.8
cffi==1.15.0
charset-normalizer==2.0.9
click==8.0.3
colored==1.4.3
coloredlogs==15.0.1
cryptography==36.0.0
cycler==0.11.0
distro==1.6.0
docker==5.0.3
fastapi==0.70.0
filelock==3.4.0
flake8==4.0.1
flatbuffers==2.0
fonttools==4.28.3
gevent==21.8.0
geventhttpclient==1.5.3
greenlet==1.1.2
grpcio==1.42.0
gunicorn==20.1.0
h11==0.12.0
httplib2==0.20.2
huggingface-hub==0.2.1
humanfriendly==10.0
idna==3.3
iniconfig==1.1.1
isort==5.10.1
joblib==1.1.0
kiwisolver==1.3.2
llvmlite==0.37.0
Mako==1.1.6
MarkupSafe==2.0.1
matplotlib==3.5.0
mccabe==0.6.1
mpmath==1.2.1
mypy-extensions==0.4.3
numba==0.54.1
numpy==1.21.4
nvidia-cublas-cu11==2021.10.25
nvidia-cublas-cu115==11.7.3.1
nvidia-cuda-runtime-cu11==2021.10.25
nvidia-cuda-runtime-cu115==11.5.50
nvidia-cudnn-cu11==2021.11.18
nvidia-cudnn-cu115==8.3.0.98
nvidia-pyindex==1.0.9
nvidia-tensorrt==8.2.1.8
onnx==1.10.2
onnx-graphsurgeon==0.3.14
onnxruntime-gpu==1.10.0
packaging==21.3
pathspec==0.9.0
pdfkit==1.0.0
Pillow==8.4.0
platformdirs==2.4.0
pluggy==1.0.0
polygraphy==0.33.2
prometheus-client==0.12.0
protobuf==3.19.1
psutil==5.8.0
py==1.11.0
pycodestyle==2.8.0
pycparser==2.21
pycuda==2021.1
pydantic==1.8.2
pyflakes==2.4.0
pyparsing==3.0.6
pytest==6.2.5
python-dateutil==2.8.2
python-rapidjson==1.5
pytools==2021.2.9
PyYAML==6.0
regex==2021.11.10
requests==2.26.0
sacremoses==0.0.46
sentencepiece==0.1.96
setuptools-scm==6.3.2
six==1.16.0
sniffio==1.2.0
starlette==0.16.0
sympy==1.9
tokenizers==0.10.3
toml==0.10.2
tomli==1.2.2
torch==1.10.0+cu113
tqdm==4.62.3
transformer-deploy==0.1.1
transformers==4.12.5
triton-model-analyzer==1.10.0
tritonclient==2.16.0
typing-extensions==4.0.1
urllib3==1.26.7
uvicorn==0.15.0
websocket-client==1.2.3
zope.event==4.5.0
zope.interface==5.4.0

Support private HuggingFace Hub models?

I think in order to support private HF Hub models, invocations of .from_pretrained, e.g. here, would need to have a parameter, use_auth_token. This parameter defaults to None. Setting it to True uses a local cached auth token (from calling $ transformers-cli login). It can also be set to a string, the API Key, found at https://huggingface.co/settings/token (or https://huggingface.co/organizations/ORG_NAME/settings/token for organizations).

Would you be open to a PR which adds this? I did something similar in fastT5.

Support other tasks/architectures?

First off: thank you! This is a great project, I'm really grateful you released it publically.

From what I can tell, this supports encoder-only architectures, and the Sequence Classification task (ex). Am I correct? If so, are there plans to support, or interest in supporting, other architectures (encoder/decoder, decoder-only) and/or tasks (Token Classification and Masked token prediction for encoder-only architectures, or Seq2SeqLM for the other architectures)?

Inference on CPU

I have converted a model as in the tutorial. And now I have a folder triton_models with
model-original.onnx
model.plan
transformer_onnx_model
transformer_tensorrt_inference
transformer_tensorrt_tokenize
model.onnx
transformer_onnx_inference
transformer_onnx_tokenize
transformer_tensorrt_model

When I run docker run -it --rm -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \ -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.01-py3 \ bash -c "pip install transformers && tritonserver --model-repository=/models"

I have the following log

`WARNING: [Torch-TensorRT] - Unable to read CUDA capable devices. Return status: 35
I0215 08:42:16.993018 1 libtorch.cc:1227] TRITONBACKEND_Initialize: pytorch
I0215 08:42:16.993126 1 libtorch.cc:1237] Triton TRITONBACKEND API version: 1.7
I0215 08:42:16.993133 1 libtorch.cc:1243] 'pytorch' TRITONBACKEND API version: 1.7
2022-02-15 08:42:17.201026: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-02-15 08:42:17.246677: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0215 08:42:17.246838 1 tensorflow.cc:2176] TRITONBACKEND_Initialize: tensorflow
I0215 08:42:17.246867 1 tensorflow.cc:2186] Triton TRITONBACKEND API version: 1.7
I0215 08:42:17.246922 1 tensorflow.cc:2192] 'tensorflow' TRITONBACKEND API version: 1.7
I0215 08:42:17.246933 1 tensorflow.cc:2216] backend configuration:
{}
I0215 08:42:17.249114 1 onnxruntime.cc:2232] TRITONBACKEND_Initialize: onnxruntime
I0215 08:42:17.249175 1 onnxruntime.cc:2242] Triton TRITONBACKEND API version: 1.7
I0215 08:42:17.249183 1 onnxruntime.cc:2248] 'onnxruntime' TRITONBACKEND API version: 1.7
I0215 08:42:17.249189 1 onnxruntime.cc:2278] backend configuration:
{}
I0215 08:42:17.271894 1 openvino.cc:1234] TRITONBACKEND_Initialize: openvino
I0215 08:42:17.271952 1 openvino.cc:1244] Triton TRITONBACKEND API version: 1.7
I0215 08:42:17.271959 1 openvino.cc:1250] 'openvino' TRITONBACKEND API version: 1.7
W0215 08:42:17.272063 1 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version
I0215 08:42:17.272140 1 cuda_memory_manager.cc:115] CUDA memory pool disabled
E0215 08:42:17.299306 1 model_repository_manager.cc:1844] Poll failed for model directory 'transformer_onnx_model': instance group transformer_onnx_model_0 of model transformer_onnx_model has kind KIND_GPU but no GPUs are available
E0215 08:42:17.311377 1 model_repository_manager.cc:1844] Poll failed for model directory 'transformer_onnx_tokenize': instance group transformer_onnx_tokenize_0 of model transformer_onnx_tokenize has kind KIND_GPU but no GPUs are available
E0215 08:42:17.326131 1 model_repository_manager.cc:1844] Poll failed for model directory 'transformer_tensorrt_model': instance group transformer_tensorrt_model_0 of model transformer_tensorrt_model has kind KIND_GPU but no GPUs are available
E0215 08:42:17.336737 1 model_repository_manager.cc:1844] Poll failed for model directory 'transformer_tensorrt_tokenize': instance group transformer_tensorrt_tokenize_0 of model transformer_tensorrt_tokenize has kind KIND_GPU but no GPUs are available
E0215 08:42:17.336814 1 model_repository_manager.cc:1332] Invalid argument: ensemble transformer_tensorrt_inference contains models that are not available: transformer_tensorrt_tokenize, transformer_tensorrt_model
E0215 08:42:17.336823 1 model_repository_manager.cc:1332] Invalid argument: ensemble transformer_onnx_inference contains models that are not available: transformer_onnx_tokenize, transformer_onnx_model
I0215 08:42:17.336862 1 server.cc:519]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0215 08:42:17.336905 1 server.cc:546]
+-------------+-------------------------------------------------------------------------+--------+
| Backend | Path | Config |
+-------------+-------------------------------------------------------------------------+--------+
| pytorch | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so | {} |
| tensorflow | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so | {} |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {} |
| openvino | /opt/tritonserver/backends/openvino_2021_2/libtriton_openvino_2021_2.so | {} |
+-------------+-------------------------------------------------------------------------+--------+

I0215 08:42:17.336925 1 server.cc:589]
+-------+---------+--------+
| Model | Version | Status |
+-------+---------+--------+
+-------+---------+--------+

I0215 08:42:17.337008 1 tritonserver.cc:1865]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.18.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics |
| model_repository_path[0] | /models |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| response_cache_byte_size | 0 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0215 08:42:17.337084 1 server.cc:249] Waiting for in-flight requests to complete.
I0215 08:42:17.337097 1 server.cc:264] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models`

How can i fix that? Thanks!

convert_model: error: argument --task: invalid choice: 'token-classification' (choose from 'classification', 'embedding', 'text-generation')

while running command,

docker run -it --rm --gpus all   -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0   bash -c "cd /project && \
    convert_model -m \"kamalkraj/bert-base-cased-ner-conll2003\" \
    --backend tensorrt onnx \
    --seq-len 16 128 128 \
    --task token-classification"

getting the following error:

usage: convert_model [-h] -m MODEL [-t TOKENIZER] [--task {classification,embedding,text-generation}] [--auth-token AUTH_TOKEN]
                     [-b BATCH_SIZE BATCH_SIZE BATCH_SIZE] [-s SEQ_LEN SEQ_LEN SEQ_LEN] [-q] [-w WORKSPACE_SIZE] [-o OUTPUT] [-n NAME]
                     [-v] [--backend [{onnx,tensorrt} [{onnx,tensorrt} ...]]] [-d {cpu,cuda}] [--nb-threads NB_THREADS]
                     [--nb-instances NB_INSTANCES] [--warmup WARMUP] [--nb-measures NB_MEASURES] [--seed SEED] [--atol ATOL]
convert_model: error: argument --task: invalid choice: 'token-classification' (choose from 'classification', 'embedding', 'text-generation')

is token-classification completely supported yet? thanks

Installing pytorch-quantization

I get the following error when pytorch-quantization is included in requirements.txt:

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting pytorch-quantization
  Downloading pytorch-quantization-0.0.1.dev5.tar.gz (7.9 kB)
  Preparing metadata (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /opt/conda/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-kmmchqi2/pytorch-quantization_5a45d9f46d524a108b6ec28eb5238500/setup.py'"'"'; __file__='"'"'/tmp/pip-install-kmmchqi2/pytorch-quantization_5a45d9f46d524a108b6ec28eb5238500/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-_58u5ior
       cwd: /tmp/pip-install-kmmchqi2/pytorch-quantization_5a45d9f46d524a108b6ec28eb5238500/
  Complete output (16 lines):
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-install-kmmchqi2/pytorch-quantization_5a45d9f46d524a108b6ec28eb5238500/setup.py", line 150, in <module>
      raise RuntimeError(open("ERROR.txt", "r").read())
  RuntimeError:
  ###########################################################################################
  The package you are trying to install is only a placeholder project on PyPI.org repository.
  This package is hosted on NVIDIA Python Package Index.
  
  This package can be installed as:
  ```
  $ pip install nvidia-pyindex
  $ pip install pytorch-quantization
  ```
  ###########################################################################################
  
  ----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/35/ea/c6c4ab73da4e36b9eddea7ff687b98e1bccb59bfb3bd0c24459914fb17f2/pytorch-quantization-0.0.1.dev5.tar.gz#sha256=4702207b088af5a1e58ee31d5ceee14aaa21bc3ef36b39ca996a6ee4d0ffb4dd (from https://pypi.org/simple/pytorch-quantization/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
  Downloading pytorch-quantization-0.0.1.dev4.tar.gz (4.1 kB)
  Preparing metadata (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /opt/conda/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-kmmchqi2/pytorch-quantization_b0afa379038d4cbc8e3d8b401f5d74f4/setup.py'"'"'; __file__='"'"'/tmp/pip-install-kmmchqi2/pytorch-quantization_b0afa379038d4cbc8e3d8b401f5d74f4/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-s1sedxj7
       cwd: /tmp/pip-install-kmmchqi2/pytorch-quantization_b0afa379038d4cbc8e3d8b401f5d74f4/
  Complete output (15 lines):
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-install-kmmchqi2/pytorch-quantization_b0afa379038d4cbc8e3d8b401f5d74f4/setup.py", line 150, in <module>
      raise RuntimeError(open("ERROR.txt", "r").read())
  RuntimeError:
  ###########################################################################################
  The package you are trying to install is only a placeholder project on PyPI.org repository.
  This package is hosted on NVIDIA Python Package Index.
  
  This package can be installed as:
  ```
  $ pip install nvidia-pyindex
  $ pip install pytorch-quantization
 ```
  ###########################################################################################
  ----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/d4/3e/e891628c040badc4d18ca48a28bf5a991654161fb32ee5f54ec2317e2664/pytorch-quantization-0.0.1.dev4.tar.gz#sha256=6fea1f1ba851353d65f08098fe19041cd045ca9239e98e5f7058cb1872b6ea57 (from https://pypi.org/simple/pytorch-quantization/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement pytorch-quantization (from versions: 0.0.1.dev4, 0.0.1.dev5)
ERROR: No matching distribution found for pytorch-quantization

I am testing this on pytorch/pytorch:1.10.0-cuda11.3-cudnn8-devel Docker container.

Transforming https://huggingface.co/csarron/roberta-base-squad-v1

Hi,

I wanted double check how we can transform question answering models like the one mentioned above. (It is question answering model - transformers.AutoModelForQuestionAnswering) For example when using vanilla python for your input:

{ "question": "Did the stock come down?", "context": "Some text about stocks, with reference if stocks went up or down..." }

Q/A pipeline is answering with score, start and end of the sequence and chunk of text (based on start and end), e.g.

{"score": 0.02739531360566616, "start": 103, "end": 173, "answer": "sent stocks sliding to their worst performance in months on Wednesday."}

after I converted this model as classification model, I am seeing given input:

"inputs": [{"name": "TEXT", "shape": [2], "datatype": "BYTES", "data": ["Did the stock come down?", "\Some text about stocks, with reference if stocks went up or down..."]}]}

following output:

{"model_name":"roberta_onnx_inference","model_version":"1","parameters":{"sequence_id":0,"sequence_start":false,"sequence_end":false},"outputs":[{"name":"output","datatype":"FP32","shape":[1,2],"data":[-0.2091064453125,-0.06671142578125]}]}

As probably expected it doesn't really output what it should, I am wondering how we can change/extend the library to generate correct outputs.
Model itself generate set of tensors (the same shape as encoded input) for start and end of sequence https://github.com/huggingface/transformers/blob/master/src/transformers/pipelines/question_answering.py#L371, here I am given just two numbers, so I cannot use the output to invoke postprocessing logic in QA pipeline

Any support appreciate it :).

query_body.bin contents

With an instance of the triton server up, running the test cURL command from the README file

# @ means no data conversion (curl feature)
curl -X POST  http://localhost:8000/v2/models/transformer_onnx_inference/versions/1/infer \
  --data-binary "@demo/query_body.bin" \
  --header "Inference-Header-Content-Length: 160"

gives me the error {"error":"unexpected inference output 'score' for model 'transformer_onnx_inference'"}. Looking at the triton_client.py script, it seems it might be the case that score should be replaced by model_score in the query_body.bin file? However, after that change I still get an error: {"error":"failed to parse the request JSON buffer: Invalid value. at 160"}%.

The triton_client.py script works fine for me, BTW.

How to run inference for T5 tensorrt model deployed on nvidia triton?

I have deployed T5 tensorrt model on nvidia triton server and below is the config.pbtxt file, but facing problem while inferencing the model using triton client.

As per the config.pbtxt file there should be 4 inputs to the tensorrt model along with the decoder ids. But how can we send decoder as input to the model I think decoder is to be generated from models output.

Is there any way to inference using triton client.

name: "tensorrt_model"
platform: "tensorrt_plan"
max_batch_size: 0
input [
 {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [ -1, -1  ]
  },

{
    name: "attention_mask"
    data_type: TYPE_INT32
    dims: [-1, -1 ]
},

{
    name: "decoder_input_ids"
    data_type: TYPE_INT32
    dims: [ -1, -1]
},

{
   name: "decoder_attention_mask"
   data_type: TYPE_INT32
   dims: [ -1, -1 ]
}

]
output [
{
    name: "last_hidden_state"
    data_type: TYPE_FP32
    dims: [ -1, -1, 768 ]
  },

{
    name: "input.151"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1 ]
  }

]

instance_group [
    {
        count: 1
        kind: KIND_GPU
    }
]

Out of memeory error for batch size more than 1 for T5 models.

hey, first of all, thanks for creating this amazing library!

I'm following your T5 implementation with trt,

input_id_shape = TensorRTShape(min_shape=[5, 1], optimal_shape=[5, 500], max_shape=[5, 500], input_name="input_ids")

And, I'm trying to convert the onnx version of the T5 model to tensorrt engine using your build_engine method,

It works fine for a batch size of 1, but for batch size > 1. it's taking longer to build (almost an hour just for the t5-small encoder), and even after that it's not building the model successfully and getting the following error :

[03/18/2022-12:51:55] [TRT] [E] 2: [virtualMemoryBuffer.cpp::resizePhysical::161] Error Code 2: OutOfMemory (no further information)
[03/18/2022-12:51:55] [TRT] [E] 2: [virtualMemoryBuffer.cpp::resizePhysical::161] Error Code 2: OutOfMemory (no further information)
[03/18/2022-12:51:55] [TRT] [E] 10: [optimizer.cpp::computeCosts::2011] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[encoder.embed_tokens.weight...Mul_406]}.)
[03/18/2022-12:51:55] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::609] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )
Traceback (most recent call last):
  File "export_onnx_to_trt.py", line 100, in <module>
    build_t5_engine(onnx_encoder_path, trt_encoder_path, [input_id_shape])
  File "export_onnx_to_trt.py", line 86, in build_t5_engine
    engine: ICudaEngine = build_engine(
  File "/app/utils.py", line 209, in build_engine
    engine: ICudaEngine = runtime.deserialize_cuda_engine(trt_engine)
TypeError: deserialize_cuda_engine(): incompatible function arguments. The following argument types are supported:
    1. (self: tensorrt.tensorrt.Runtime, serialized_engine: buffer) -> tensorrt.tensorrt.ICudaEngine

Invoked with: <tensorrt.tensorrt.Runtime object at 0x7f380bbf8930>, None

some system info if that helps;

  • trt+cuda - 8.2.1-1+cuda11.4
  • os - ubuntu 20.04.3
  • gpu - T4 with 15GB memory

the errors say I need more GPU memory, I was wondering how much GPU memory did you use for a batch size of 5? or maybe I'm missing something?

I would really appreciate any help, thank you!

Docker image timeout

Error response from daemon: Get https://ghcr.io/v2/: proxyconnect tcp: dial tcp: lookup http.docker.internal on 192.168.65.5:53: read udp 192.168.65.4:63771->192.168.65.5:53: i/o timeout

I cannot download the docker image!

Error in docker container using pip install

What is happening?
I'm using the docker container ghcr.io/els-rd/transformer-deploy:0.4.0 to run the embeddings example in the doc using the following command:

convert_model -m sentence-transformers/msmarco-distilbert-cos-v5 --backend onnx --task embedding --seq-len 128 128 128

When I launch the container and run the above command, I get the expected output without any issues. However, if I update the transformer-deploy package using pip install . and then run the above command, I get the following error:

Traceback (most recent call last):
  File "/usr/local/bin/convert_model", line 8, in <module>
    sys.exit(entrypoint())
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 388, in entrypoint
    main(commands=args)
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 355, in main
    ort_output, time_buffer = launch_inference(
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 113, in launch_inference
    output = infer(batch_input)
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 351, in infer_ort
    results = inference_onnx_binding(model_onnx=ort_model, inputs=inputs, device=commands.device)
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/backends/ort_utils.py", line 255, in inference_onnx_binding
    binding.synchronize_inputs()
AttributeError: 'IOBinding' object has no attribute 'synchronize_inputs'

I want to try something which requires changing some lines in the convert.py file. I was hoping to install using pip after making the change and running test command as illustrated above. Is there more to installing the package? Or is this a bug?

To Reproduce

git clone [email protected]:ELS-RD/transformer-deploy.git
cd transformer-deploy
# docker image may take a few minutes
docker pull ghcr.io/els-rd/transformer-deploy:0.4.0 

# run the docker container
docker run -it --rm --gpus all  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 bash

# run the following commands from inside the container
# run test command. This should run without any issues
convert_model -m sentence-transformers/msmarco-distilbert-cos-v5 --backend onnx --task embedding --seq-len 128 128 128

# install transformer-deploy from the latest commit
pip install .

# run the test command again and it will error out
convert_model -m sentence-transformers/msmarco-distilbert-cos-v5 --backend onnx --task embedding --seq-len 128 128 128

Summary
After installing the latest commit in the docker container, convert command errors out. Is there a different installation procedure ? Or is this a bug?

GPT2 has slow inference

Hello,

your wrapper for gpt2 does not support 'past_key_values' as huggingface transformers initially do. I've seen your measurements in the gpt2 demo, and at least for pytorch they are not really correct, instead of just simply calling the model with always the same input, you should call the generate method..

I tried to run gpt2 in pytorch both on cpu and gpu (GPU: TESLA T4) with your sample text: "Here is some text to encode Hello World"

here are my results (vanilla pytorch):
gpu no cache: 14s/sequence
gpu cache: 3.6s/sequence

cpu no cache: 114s/sequence
cpu cache: 9.8s/sequence

For every measurement, the result is average out of ten runs of the generate method, I used number of beams=5

when running greedysearch, the difference is not so big, but still..
cpu no cache: 29s
cpu cache: 4.8s

CPU: Intel(R) Xeon(R) Platinum 8259CL CPU

Output difference between ONNX and Pytorch in T5 notebook

Hi @pommedeterresautee Just checked out the latest T5 optimization notebook . Everything seems to be working well for T5-base pretrained model from huggingface but when I am trying to optimize model from valhalla/t5-base-qa-qg-hl. I am noticing difference in output between the two.

Below is code snippet modified from T5 notebook.

input_text=''' Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum
and first released in 1991, Python's design philosophy emphasizes code
readability with its notable use of significant whitespace.'''
inputs=tokenizer(input_text,return_tensors="pt")
input_ids=inputs.input_ids.cuda()

torch.cuda.synchronize()
with torch.inference_mode():
print("Onnx:")
print(
tokenizer.decode(
model_gen.generate(
inputs=input_ids,
min_length=3,
max_length=60,
num_beams=4,
no_repeat_ngram_size=2,
)[0],
skip_special_tokens=True,
)
)
print("Pytorch:")
print(
tokenizer.decode(
pytorch_model.generate(
input_ids=input_ids,
min_length=3,
max_length=60,
num_beams=4,
no_repeat_ngram_size=2,
)[0],
skip_special_tokens=True,
)
)

And below are the outputs:

Onnx:
What is an interpreted, high-level, general-purpose programming language?
Pytorch:
What language was created by Guido van Rossum?

How to convert_model -m "./mycustomodel"

I have a custom model

`
class BERTClass(torch.nn.Module):
def init(self, num_labels=4):
super(BERTClass, self).init()
self.l1 = BertModel.from_pretrained('bert-base-multilingual-uncased')
self.l2 = torch.mean
self.l3 = torch.nn.Linear(768, num_labels)

def forward(self, ids, masks, token_type_ids):        
    last_hidden_state, _ = self.l1(ids, attention_mask=masks, token_type_ids=token_type_ids)
    avg_pooling = self.l2(last_hidden_state, dim=1)
    output = self.l3(avg_pooling)
    
    return output`

How do I go about saving my custom model so that I can run convert_model -m "./mycustomodel"?

Currently I am saving the model this way

`
model_2_save = model.module if hasattr(model, "module") else model

checkpoint = {
    'epoch': args.epochs,
    'num_labels': args.num_labels,
    'max_text_length': MAX_TEXT_LENGTH,
    'state_dict': model_2_save.state_dict()
}

torch.save(checkpoint, args.model_dir + "/pt_model.pt")`

Is it better to save the pretrained bit and convert separately from the fully connected layer and then combine them after conversion or do I need to drive my custom class from PreTrainedModel so as to be able to use save_pretrained? Do you happen to have an example I can follow? Thanks for the amazing repo.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.