The Triton backend for the ONNX Runtime.

License: BSD 3-Clause "New" or "Revised" License

CMake 9.64% C++ 75.37% Python 13.94% Shell 1.04%

triton-inference-server backend onnx-runtime inference

onnxruntime_backend's Introduction

ONNX Runtime Backend

The Triton backend for the ONNX Runtime. You can learn more about Triton backends in the backend repo. Ask questions or report problems on the issues page.

Use a recent cmake to build and install in a local directory. Typically you will want to build an appropriate ONNX Runtime implementation as part of the build. You do this by specifying a ONNX Runtime version and a Triton container version that you want to use with the backend. You can find the combination of versions used in a particular Triton release in the TRITON_VERSION_MAP at the top of build.py in the branch matching the Triton release you are interested in. For example, to build the ONNX Runtime backend for Triton 23.04, use the versions from TRITON_VERSION_MAP in the r23.04 branch of build.py.

$ mkdir build
$ cd build
$ cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install -DTRITON_BUILD_ONNXRUNTIME_VERSION=1.14.1 -DTRITON_BUILD_CONTAINER_VERSION=23.04 ..
$ make install

The resulting install/backends/onnxruntime directory can be added to a Triton installation as /opt/tritonserver/backends/onnxruntime.

The following required Triton repositories will be pulled and used in the build. By default the "main" branch/tag will be used for each repo but the listed CMake argument can be used to override.

triton-inference-server/backend: -DTRITON_BACKEND_REPO_TAG=[tag]
triton-inference-server/core: -DTRITON_CORE_REPO_TAG=[tag]
triton-inference-server/common: -DTRITON_COMMON_REPO_TAG=[tag]

You can add TensorRT support to the ONNX Runtime backend by using -DTRITON_ENABLE_ONNXRUNTIME_TENSORRT=ON. You can add OpenVino support by using -DTRITON_ENABLE_ONNXRUNTIME_OPENVINO=ON -DTRITON_BUILD_ONNXRUNTIME_OPENVINO_VERSION=<version>, where <version> is the OpenVino version to use and should match the TRITON_VERSION_MAP entry as described above. So, to build with both TensorRT and OpenVino support:

$ mkdir build
$ cd build
$ cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install -DTRITON_BUILD_ONNXRUNTIME_VERSION=1.14.1 -DTRITON_BUILD_CONTAINER_VERSION=23.04 -DTRITON_ENABLE_ONNXRUNTIME_TENSORRT=ON -DTRITON_ENABLE_ONNXRUNTIME_OPENVINO=ON -DTRITON_BUILD_ONNXRUNTIME_OPENVINO_VERSION=2021.2.200 ..
$ make install

ONNX Runtime with TensorRT optimization

TensorRT can be used in conjunction with an ONNX model to further optimize the performance. To enable TensorRT optimization you must set the model configuration appropriately. There are several optimizations available for TensorRT, like selection of the compute precision and workspace size. The optimization parameters and their description are as follows.

precision_mode: The precision used for optimization. Allowed values are "FP32", "FP16" and "INT8". Default value is "FP32".
max_workspace_size_bytes: The maximum GPU memory the model can use temporarily during execution. Default value is 1GB.
int8_calibration_table_name: Specify INT8 calibration table name. Applicable when precision_mode=="INT8" and the models do not contain Q/DQ nodes. If calibration table is provided for model with Q/DQ nodes then ORT session creation will fail.
int8_use_native_calibration_table: Calibration table to use. Allowed values are 1 (use native TensorRT generated calibration table) and 0 (use ORT generated calibration table). Default is 0. **Note: Latest calibration table file needs to be copied to trt_engine_cache_path before inference. Calibration table is specific to models and calibration data sets. Whenever new calibration table is generated, old file in the path should be cleaned up or be replaced.
trt_engine_cache_enable: Enable engine caching.
trt_engine_cache_path: Specify engine cache path.

To explore the usage of more parameters, follow the mapping table below and check ONNX Runtime doc for detail.

Please link to the latest ONNX Runtime binaries in CMake or build from main branch of ONNX Runtime to enable latest options.

Parameter mapping between ONNX Runtime and Triton ONNXRuntime Backend

Key in Triton model configuration	Value in Triton model config	Corresponding TensorRT EP option in ONNX Runtime	Type
max_workspace_size_bytes	e.g: "4294967296"	trt_max_workspace_size	int
trt_max_partition_iterations	e.g: "1000"	trt_max_partition_iterations	int
trt_min_subgraph_size	e.g: "1"	trt_min_subgraph_size	int
precision_mode	"FP16"	trt_fp16_enable	bool
precision_mode	"INT8"	trt_int8_enable	bool
int8_calibration_table_name		trt_int8_calibration_table_name	string
int8_use_native_calibration_table	e.g: "1" or "true", "0" or "false"	trt_int8_use_native_calibration_table	bool
trt_dla_enable		trt_dla_enable	bool
trt_dla_core	e.g: "0"	trt_dla_core	int
trt_engine_cache_enable	e.g: "1" or "true", "0" or "false"	trt_engine_cache_enable	bool
trt_engine_cache_path		trt_engine_cache_path	string
trt_engine_cache_prefix		trt_engine_cache_prefix	string
trt_dump_subgraphs	e.g: "1" or "true", "0" or "false"	trt_dump_subgraphs	bool
trt_force_sequential_engine_build	e.g: "1" or "true", "0" or "false"	trt_force_sequential_engine_build	bool
trt_context_memory_sharing_enable	e.g: "1" or "true", "0" or "false"	trt_context_memory_sharing_enable	bool
trt_layer_norm_fp32_fallback	e.g: "1" or "true", "0" or "false"	trt_layer_norm_fp32_fallback	bool
trt_timing_cache_enable	e.g: "1" or "true", "0" or "false"	trt_timing_cache_enable	bool
trt_timing_cache_path		trt_timing_cache_path	string
trt_force_timing_cache	e.g: "1" or "true", "0" or "false"	trt_force_timing_cache	bool
trt_detailed_build_log	e.g: "1" or "true", "0" or "false"	trt_detailed_build_log	bool
trt_build_heuristics_enable	e.g: "1" or "true", "0" or "false"	trt_build_heuristics_enable	bool
trt_sparsity_enable	e.g: "1" or "true", "0" or "false"	trt_sparsity_enable	bool
trt_builder_optimization_level	e.g: "3"	trt_builder_optimization_level	int
trt_auxiliary_streams	e.g: "-1"	trt_auxiliary_streams	int
trt_tactic_sources	e.g: "-CUDNN,+CUBLAS";	trt_tactic_sources	string
trt_extra_plugin_lib_paths		trt_extra_plugin_lib_paths	string
trt_profile_min_shapes	e.g: "input1:dim1xdimd2...,input2:dim1xdim2...,..."	trt_profile_min_shapes	string
trt_profile_max_shapes	e.g: "input1:dim1xdimd2...,input2:dim1xdim2...,..."	trt_profile_max_shapes	string
trt_profile_opt_shapes	e.g: "input1:dim1xdimd2...,input2:dim1xdim2...,..."	trt_profile_opt_shapes	string
trt_cuda_graph_enable	e.g: "1" or "true", "0" or "false"	trt_cuda_graph_enable	bool
trt_dump_ep_context_model	e.g: "1" or "true", "0" or "false"	trt_dump_ep_context_model	bool
trt_ep_context_file_path		trt_ep_context_file_path	string
trt_ep_context_embed_mode	e.g: "1"	trt_ep_context_embed_mode	int

The section of model config file specifying these parameters will look like:

.
.
.
optimization { execution_accelerators {
  gpu_execution_accelerator : [ {
    name : "tensorrt"
    parameters { key: "precision_mode" value: "FP16" }
    parameters { key: "max_workspace_size_bytes" value: "1073741824" }}
    parameters { key: "trt_engine_cache_enable" value: "1" }}
  ]
}}
.
.
.

ONNX Runtime with CUDA Execution Provider optimization

When GPU is enabled for ORT, CUDA execution provider is enabled. If TensorRT is also enabled then CUDA EP is treated as a fallback option (only comes into picture for nodes which TensorRT cannot execute). If TensorRT is not enabled then CUDA EP is the primary EP which executes the models. ORT enabled configuring options for CUDA EP to further optimize based on the specific model and user scenarios. To enable CUDA EP optimization you must set the model configuration appropriately. There are several optimizations available, like selection of max mem, cudnn conv algorithm etc... The optimization parameters and their description are as follows.

cudnn_conv_algo_search: CUDA Convolution algorithm search configuration. Available options are 0 - EXHAUSTIVE (expensive exhaustive benchmarking using cudnnFindConvolutionForwardAlgorithmEx). This is also the default option, 1 - HEURISTIC (lightweight heuristic based search using cudnnGetConvolutionForwardAlgorithm_v7), 2 - DEFAULT (default algorithm using CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM)
gpu_mem_limit: CUDA memory limit. To use all possible memory pass in maximum size_t. Defaults to SIZE_MAX.
arena_extend_strategy: Strategy used to grow the memory arena. Available options are: 0 = kNextPowerOfTwo, 1 = kSameAsRequested. Defaults to 0.
do_copy_in_default_stream: Flag indicating if copying needs to take place on the same stream as the compute stream in the CUDA EP. Available options are: 0 = Use separate streams for copying and compute, 1 = Use the same stream for copying and compute. Defaults to 1.

The section of model config file specifying these parameters will look like:

.
.
.
parameters { key: "cudnn_conv_algo_search" value: { string_value: "0" } }
parameters { key: "gpu_mem_limit" value: { string_value: "4294967200" } }
.
.
.

ONNX Runtime with OpenVINO optimization

OpenVINO can be used in conjunction with an ONNX model to further optimize performance. To enable OpenVINO optimization you must set the model configuration as shown below.

.
.
.
optimization { execution_accelerators {
  cpu_execution_accelerator : [ {
    name : "openvino"
  } ]
}}
.
.
.

Other Optimization Options with ONNX Runtime

Details regarding when to use these options and what to expect from them can be found here

Model Config Options

intra_op_thread_count: Sets the number of threads used to parallelize the execution within nodes. A value of 0 means ORT will pick a default which is number of cores.
inter_op_thread_count: Sets the number of threads used to parallelize the execution of the graph (across nodes). If sequential execution is enabled this value is ignored. A value of 0 means ORT will pick a default which is number of cores.
execution_mode: Controls whether operators in the graph are executed sequentially or in parallel. Usually when the model has many branches, setting this option to 1 .i.e. "parallel" will give you better performance. Default is 0 which is "sequential execution."
level: Refers to the graph optimization level. By default all optimizations are enabled. Allowed values are -1, 1 and 2. -1 refers to BASIC optimizations, 1 refers to basic plus extended optimizations like fusions and 2 refers to all optimizations being disabled. Please find the details here.

optimization {
  graph : {
    level : 1
}}

parameters { key: "intra_op_thread_count" value: { string_value: "0" } }
parameters { key: "execution_mode" value: { string_value: "0" } }
parameters { key: "inter_op_thread_count" value: { string_value: "0" } }

enable_mem_arena: Use 1 to enable the arena and 0 to disable. See this for more information.
enable_mem_pattern: Use 1 to enable memory pattern and 0 to disable. See this for more information.
memory.enable_memory_arena_shrinkage: See this for more information.

Command line options

Thread Pools

When intra and inter op threads is set to 0 or a value higher than 1, by default ORT creates threadpool per session. This may not be ideal in every scenario, therefore ORT also supports global threadpools. When global threadpools are enabled ORT creates 1 global threadpool which is shared by every session. Use the backend config to enable global threadpool. When global threadpool is enabled, intra and inter op num threads config should also be provided via backend config. Config values provided in model config will be ignored.

--backend-config=onnxruntime,enable-global-threadpool=<0,1>, --backend-config=onnxruntime,intra_op_thread_count=<int> , --backend-config=onnxruntime,inter_op_thread_count=<int>

Default Max Batch Size

The default-max-batch-size value is used for max_batch_size during Autocomplete when no other value is found. Assuming server was not launched with --disable-auto-complete-config command-line option, the onnxruntime backend will set the max_batch_size of the model to this default value under the following conditions:

Autocomplete has determined the model is capable of batching requests.
max_batch_size is 0 in the model configuration or max_batch_size is omitted from the model configuration.

If max_batch_size > 1 and no scheduler is provided, the dynamic batch scheduler will be used.

--backend-config=onnxruntime,default-max-batch-size=<int>

The default value of default-max-batch-size is 4.

onnxruntime_backend's People

Contributors

Stargazers

Watchers

onnxruntime_backend's Issues

Memory Leaks Cause Server OOMs (CPU, TF2/ONNX)

Description
Using latest Triton v22.01 with ONNXBackend (self-built) or with the TensorFlow Backend (image from Nvidia Container Registry) and performing the CPU inference there seems to be a memory leak: memory usage increases with each request, finally leading to an OOM for the Triton Inference Server instance.

Triton Information
What version of Triton are you using? v22.01

Are you using the Triton container or did you build it yourself? Both the Triton container version and custom ONNX build are affected.

To Reproduce
Steps to reproduce the behavior.

Run the Inference server to host the CRAFT model either in a model.savedmodel format (for TensorFlow backend) or model.onnx format.
Perform multiple inferences using tritonclient[all]==2.18.0.
Watch the memory usage of Triton Inference Server to increase (locally using docker stats).

Model configuration file:

name: "craft"
backend: "onnxruntime"
max_batch_size: 1
input [
  {
    name: "input_2"
    data_type: TYPE_FP32
    dims: [-1, -1, 3]
  }
]
output [
  {
    name: "conv_cls.8"
    data_type: TYPE_FP32
    dims: [-1, -1, 2]
  }
]
instance_group {
  kind: KIND_CPU
  count: 1
}
model_warmup {                                                                        
    name: "CRAFT Warmup"
    batch_size: 1                                                                     
    inputs: {                                                                         
        key: "input_2"                                                                  
        value: {                                                                      
            data_type: TYPE_FP32                                                      
            dims: [1024, 1024, 3]                                                                 
            zero_data: false
        }                                                                             
    }                                                                                 
}

Example tritonclient usage (Python 3.10):

        import tritonclient.grpc as grpcclient
        from tritonclient.grpc import InferenceServerClient, InferenceServerException
  
        client = InferenceServerClient(
            url=TRITON_GRPC_SERVICE,
        )
        TRITON_CRAFT_MODEL_NAME = "craft"
        TRITON_CRAFT_MODEL_VERSION = "1"
        TRITON_CRAFT_MODEL_INPUT_NAME = "input_2"
        TRITON_CRAFT_MODEL_OUTPUT_NAME = "conv_cls.8"

        input = grpcclient.InferInput(
            name=TRITON_CRAFT_MODEL_INPUT_NAME,
            shape=[1, input_image.shape[0], input_image.shape[1], input_image.shape[2]],
            datatype="FP32",
        )
        output = grpcclient.InferRequestedOutput(TRITON_CRAFT_MODEL_OUTPUT_NAME)
        input.set_data_from_numpy(np.array([input_image]))
            
         response = self.client.infer(
                model_name=TRITON_CRAFT_MODEL_NAME, inputs=[input], outputs=[output], 
                client_timeout=TRITON_GRPC_TIMEOUT
            )
        result = response.as_numpy(name=TRITON_CRAFT_MODEL_OUTPUT_NAME)

Example output of Triton Inference Server with --log-verbose=1 flag:

I0217 15:49:51.941706 1 grpc_server.cc:3206] New request handler for ModelInferHandler, 89
I0217 15:49:51.941713 1 model_repository_manager.cc:590] GetModel() 'craft' version -1
I0217 15:49:51.941723 1 model_repository_manager.cc:590] GetModel() 'craft' version -1
I0217 15:49:51.941744 1 infer_request.cc:566] prepared: [0x0x7f43c401c490] request id: , model: craft, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f43c401c748] input: input_2, type: FP32, original shape: [1,1024,1024,3], batch + shape: [1,1024,1024,3], shape: [1024,1024,3]
override inputs:
inputs:
[0x0x7f43c401c748] input: input_2, type: FP32, original shape: [1,1024,1024,3], batch + shape: [1,1024,1024,3], shape: [1024,1024,3]
original requested outputs:
conv_cls.8
requested outputs:
conv_cls.8

I0217 15:49:51.941788 1 onnxruntime.cc:2427] model craft, instance craft_0, executing 1 requests
I0217 15:49:51.941796 1 onnxruntime.cc:1334] TRITONBACKEND_ModelExecute: Running craft_0 with 1 requests
2022-02-17 15:49:51.941856332 [I:onnxruntime:log, bfc_arena.cc:306 AllocateRawInternal] Extending BFCArena for Cpu. bin_num:20 (requested) num_bytes: 549453824 (actual) rounded_bytes:549453824
2022-02-17 15:49:51.941888464 [I:onnxruntime:log, bfc_arena.cc:186 Extend] Extended allocation by 1073741824 bytes.
2022-02-17 15:49:51.941898397 [I:onnxruntime:log, bfc_arena.cc:189 Extend] Total allocated bytes: 2281701376
2022-02-17 15:49:51.941904942 [I:onnxruntime:log, bfc_arena.cc:192 Extend] Allocated memory at 0x7f435bffe040 to 0x7f439bffe040
2022-02-17 15:49:51.959883920 [I:onnxruntime:, sequential_executor.cc:155 Execute] Begin execution

The last part is especially important, since it seems that the ONNX Runtime (in this case) allocates an additional 1 GB+ of memory:

2022-02-17 15:49:51.941888464 [I:onnxruntime:log, bfc_arena.cc:186 Extend] Extended allocation by 1073741824 bytes.
2022-02-17 15:49:51.941898397 [I:onnxruntime:log, bfc_arena.cc:189 Extend] Total allocated bytes: 2281701376
2022-02-17 15:49:51.941904942 [I:onnxruntime:log, bfc_arena.cc:192 Extend] Allocated memory at 0x7f435bffe040 to 0x7f439bffe040
2022-02-17 15:49:51.959883920 [I:onnxruntime:, sequential_executor.cc:155 Execute] Begin execution

OOM happens after the last log.

Expected behavior

Nvidia Triton Inference Server should not allow for the OOM. Memory should be freed after the inference request is performed.

How can I fix this issue?

ORT library build should happen in the onnxruntime_backend repo

Currently the ort library is build from the "server" repo as part of the build.py script. Because that library is only used by onnxruntime_backend repo, it should instead be build here. Keeping the build local to this repo will also simplify updating versions, etc.

Shared weights whenever multiple instances

Also CPU

Allow to specify tensorrt cache path per version

When using onnx with tensorrt it saves a lot of time to use the tensorrt cache path. The drawback is that onnxruntime is not smart enough to avoid using the same cache if the model is different or tensorrt version changed causing a lot of errors.

It would be great if it could generate a tensorrt cache path per version of the model, that would solve at least generating wrong outputs when changing model version. If the path could contain GPU model and tensorrt version that would solve the other case as well, but I think that's a less problem as it's acceptable to clear the cache when deploying new versions.

The Warmup feature solves all this issues but it comes at the cost of very slow startup, some models can take minutes to generate the tensorrt plan.

Add support for ragged batching (especially useful for BERT-type models).

This requires work in the backend. (David we can talk more. We used this in our BERT submission) Significant new code and some coordination. Potentially very valuable. Much more than 10%, significant.

Add execution_mode and inter_op_num_threads config options

Is your feature request related to a problem? Please describe.
There is no way to set execution mode and inter op num threads when using ort with triton. These options are required for perf tunning.

Describe the solution you'd like
Add these options just like intra_op_num_threads

Describe alternatives you've considered
No alternatives available

Additional context
This is request by Olive team for Olive + MA integration.

Segmentation fault on initialization

Description
I have build Triton inference server from scratch. The server is working fine for most of the time. Occasionally the server is not initialized while restarting.

Also what is the right procedure to stop the server ?

Triton Information
2.11.0dev

Logs:
Verbose log:

I1130 13:52:53.786220 84965 metrics.cc:228] Collecting metrics for GPU 0: Quadro RTX 5000
I1130 13:52:53.786708 84965 shared_library.cc:44] OpenLibraryHandle: tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
I1130 13:52:53.794101 84965 onnxruntime.cc:1830] TRITONBACKEND_Initialize: onnxruntime
I1130 13:52:53.794129 84965 onnxruntime.cc:1843] Triton TRITONBACKEND API version: 1.2
I1130 13:52:53.794138 84965 onnxruntime.cc:1849] 'onnxruntime' TRITONBACKEND API version: 1.2
I1130 13:52:54.052467 84965 pinned_memory_manager.cc:206] Pinned memory pool is created at '0x7fb5c6000000' with size 268435456
I1130 13:52:54.053931 84965 cuda_memory_manager.cc:103] CUDA memory pool is created on device 0 with size 67108864
I1130 13:52:54.055750 84965 backend_factory.h:44] Create TritonBackendFactory
I1130 13:52:54.055768 84965 ensemble_backend_factory.cc:47] Create EnsembleBackendFactory
I1130 13:52:54.061321 84965 model_repository_manager.cc:747] AsyncLoad() 'lm_english'
I1130 13:52:54.061375 84965 model_repository_manager.cc:986] TriggerNextAction() 'lm_english' version 1: 1
I1130 13:52:54.061387 84965 model_repository_manager.cc:1024] Load() 'lm_english' version 1
I1130 13:52:54.061393 84965 model_repository_manager.cc:1043] loading: lm_english:1
I1130 13:52:54.161580 84965 model_repository_manager.cc:747] AsyncLoad() 'gector_spanish'
I1130 13:52:54.161638 84965 model_repository_manager.cc:986] TriggerNextAction() 'gector_spanish' version 1: 1
I1130 13:52:54.161648 84965 model_repository_manager.cc:1024] Load() 'gector_spanish' version 1
I1130 13:52:54.161656 84965 model_repository_manager.cc:1043] loading: gector_spanish:1
I1130 13:52:54.161717 84965 model_repository_manager.cc:1103] CreateInferenceBackend() 'lm_english' version 1
I1130 13:52:54.162690 84965 onnxruntime.cc:1891] TRITONBACKEND_ModelInitialize: lm_english (version 1)
I1130 13:52:54.164107 84965 model_config_utils.cc:1521] ModelConfig 64-bit fields:
I1130 13:52:54.164121 84965 model_config_utils.cc:1523] 	ModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds
I1130 13:52:54.164127 84965 model_config_utils.cc:1523] 	ModelConfig::dynamic_batching::max_queue_delay_microseconds
I1130 13:52:54.164134 84965 model_config_utils.cc:1523] 	ModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds
I1130 13:52:54.164139 84965 model_config_utils.cc:1523] 	ModelConfig::ensemble_scheduling::step::model_version
I1130 13:52:54.164145 84965 model_config_utils.cc:1523] 	ModelConfig::input::dims
I1130 13:52:54.164152 84965 model_config_utils.cc:1523] 	ModelConfig::input::reshape::shape
I1130 13:52:54.164158 84965 model_config_utils.cc:1523] 	ModelConfig::model_warmup::inputs::value::dims
I1130 13:52:54.164163 84965 model_config_utils.cc:1523] 	ModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim
I1130 13:52:54.164169 84965 model_config_utils.cc:1523] 	ModelConfig::optimization::cuda::graph_spec::input::value::dim
I1130 13:52:54.164175 84965 model_config_utils.cc:1523] 	ModelConfig::output::dims
I1130 13:52:54.164181 84965 model_config_utils.cc:1523] 	ModelConfig::output::reshape::shape
I1130 13:52:54.164187 84965 model_config_utils.cc:1523] 	ModelConfig::sequence_batching::direct::max_queue_delay_microseconds
I1130 13:52:54.164192 84965 model_config_utils.cc:1523] 	ModelConfig::sequence_batching::max_sequence_idle_microseconds
I1130 13:52:54.164198 84965 model_config_utils.cc:1523] 	ModelConfig::sequence_batching::oldest::max_queue_delay_microseconds
I1130 13:52:54.164204 84965 model_config_utils.cc:1523] 	ModelConfig::version_policy::specific::versions
WARNING: Since openmp is enabled in this build, this API cannot be used to configure intra op num threads. Please use the openmp environment variables to control the number of threads.
I1130 13:52:54.164623 84965 onnxruntime.cc:1935] TRITONBACKEND_ModelInstanceInitialize: lm_english_0 (GPU device 0)
I1130 13:52:54.167597 84965 backend_model_instance.cc:110] Creating instance lm_english_0 on GPU 0 (7.5) using artifact 'model.onnx'
I1130 13:52:54.168972 84965 onnxruntime.cc:272] CUDA Execution Accelerator is set for 'lm_english' on device 0
2021-11-30 05:52:54.184032982 [I:onnxruntime:, inference_session.cc:230 operator()] Flush-to-zero and denormal-as-zero are off
2021-11-30 05:52:54.184066626 [I:onnxruntime:, inference_session.cc:237 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true
I1130 13:52:54.266827 84965 model_repository_manager.cc:747] AsyncLoad() 'gector'
I1130 13:52:54.266902 84965 model_repository_manager.cc:986] TriggerNextAction() 'gector' version 1: 1
I1130 13:52:54.266914 84965 model_repository_manager.cc:1024] Load() 'gector' version 1
I1130 13:52:54.266907 84965 model_repository_manager.cc:1103] CreateInferenceBackend() 'gector_spanish' version 1
I1130 13:52:54.266922 84965 model_repository_manager.cc:1043] loading: gector:1
I1130 13:52:54.267826 84965 onnxruntime.cc:1891] TRITONBACKEND_ModelInitialize: gector_spanish (version 1)
WARNING: Since openmp is enabled in this build, this API cannot be used to configure intra op num threads. Please use the openmp environment variables to control the number of threads.
I1130 13:52:54.268693 84965 onnxruntime.cc:1935] TRITONBACKEND_ModelInstanceInitialize: gector_spanish_0 (GPU device 0)
I1130 13:52:54.269407 84965 backend_model_instance.cc:110] Creating instance gector_spanish_0 on GPU 0 (7.5) using artifact 'model.onnx'
I1130 13:52:54.269467 84965 onnxruntime.cc:272] CUDA Execution Accelerator is set for 'gector_spanish' on device 0
2021-11-30 05:52:54.269494069 [I:onnxruntime:, inference_session.cc:237 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true
I1130 13:52:54.367395 84965 model_repository_manager.cc:1103] CreateInferenceBackend() 'gector' version 1
I1130 13:52:54.368207 84965 onnxruntime.cc:1891] TRITONBACKEND_ModelInitialize: gector (version 1)
WARNING: Since openmp is enabled in this build, this API cannot be used to configure intra op num threads. Please use the openmp environment variables to control the number of threads.
I1130 13:52:54.369070 84965 onnxruntime.cc:1935] TRITONBACKEND_ModelInstanceInitialize: gector_0 (GPU device 0)
I1130 13:52:54.369789 84965 backend_model_instance.cc:110] Creating instance gector_0 on GPU 0 (7.5) using artifact 'model.onnx'
I1130 13:52:54.369838 84965 onnxruntime.cc:272] CUDA Execution Accelerator is set for 'gector' on device 0
2021-11-30 05:52:54.369865425 [I:onnxruntime:, inference_session.cc:237 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true
2021-11-30 05:52:54.965184403 [I:onnxruntime:log, bfc_arena.cc:25 BFCArena] Creating BFCArena for Cuda with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 memory limit: 18446744073709551615 arena_extend_strategy: 0
2021-11-30 05:52:54.965184528 [I:onnxruntime:log, bfc_arena.cc:25 BFCArena] Creating BFCArena for Cuda with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 memory limit: 18446744073709551615 arena_extend_strategy: 0
2021-11-30 05:52:54.965224829 [V:onnxruntime:log, bfc_arena.cc:61 BFCArena] Creating 21 bins of max chunk size 256 to 268435456
2021-11-30 05:52:54.965240464 [V:onnxruntime:log, bfc_arena.cc:61 BFCArena] Creating 21 bins of max chunk size 256 to 268435456
2021-11-30 05:52:54.965258467 [I:onnxruntime:log, bfc_arena.cc:25 BFCArena] Creating BFCArena for CudaPinned with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 memory limit: 18446744073709551615 arena_extend_strategy: 0
2021-11-30 05:52:54.965258381 [I:onnxruntime:log, bfc_arena.cc:25 BFCArena] Creating BFCArena for CudaPinned with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 memory limit: 18446744073709551615 arena_extend_strategy: 0
2021-11-30 05:52:54.965267424 [V:onnxruntime:log, bfc_arena.cc:61 BFCArena] Creating 21 bins of max chunk size 256 to 268435456
2021-11-30 05:52:54.965274506 [V:onnxruntime:log, bfc_arena.cc:61 BFCArena] Creating 21 bins of max chunk size 256 to 268435456
2021-11-30 05:52:54.965293284 [I:onnxruntime:log, bfc_arena.cc:25 BFCArena] Creating BFCArena for CUDA_CPU with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 memory limit: 18446744073709551615 arena_extend_strategy: 0
2021-11-30 05:52:54.965301742 [I:onnxruntime:log, bfc_arena.cc:25 BFCArena] Creating BFCArena for CUDA_CPU with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 memory limit: 18446744073709551615 arena_extend_strategy: 0
2021-11-30 05:52:54.965302497 [V:onnxruntime:log, bfc_arena.cc:61 BFCArena] Creating 21 bins of max chunk size 256 to 268435456
2021-11-30 05:52:54.965313614 [V:onnxruntime:log, bfc_arena.cc:61 BFCArena] Creating 21 bins of max chunk size 256 to 268435456
2021-11-30 05:52:54.965382174 [I:onnxruntime:, inference_session.cc:1141 Initialize] Initializing session.
2021-11-30 05:52:54.965381623 [I:onnxruntime:, inference_session.cc:1141 Initialize] Initializing session.
2021-11-30 05:52:54.965400328 [I:onnxruntime:, inference_session.cc:1178 Initialize] Adding default CPU execution provider.
2021-11-30 05:52:54.965404988 [I:onnxruntime:, inference_session.cc:1178 Initialize] Adding default CPU execution provider.
2021-11-30 05:52:54.965412858 [I:onnxruntime:log, bfc_arena.cc:25 BFCArena] Creating BFCArena for Cpu with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 memory limit: 18446744073709551615 arena_extend_strategy: 0
2021-11-30 05:52:54.965417716 [I:onnxruntime:log, bfc_arena.cc:25 BFCArena] Creating BFCArena for Cpu with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 memory limit: 18446744073709551615 arena_extend_strategy: 0
2021-11-30 05:52:54.965428715 [V:onnxruntime:log, bfc_arena.cc:61 BFCArena] Creating 21 bins of max chunk size 256 to 268435456
2021-11-30 05:52:54.965433650 [V:onnxruntime:log, bfc_arena.cc:61 BFCArena] Creating 21 bins of max chunk size 256 to 268435456

Expose ONNXRuntime memory arena shrinkage option as configurable parameter.

Is your feature request related to a problem? Please describe.
Due to ONNXRuntime memory arena, if a model is inferenced with a large batch size, the model will take huge amount of GRAM forever, which prevents other models from running.

Describe the solution you'd like
ONNXRuntime has a run config option "memory.enable_memory_arena_shrinkage" to address this issue. With it enabled, memory arena can be shrinked after each run, with some compromise on performance. This option should be exposed as a configurable parameter.

I have implemented my own version as follows in onnxruntime.cc and it solved my problem. I think many other people would like the same feature. However, I hard-coded cpu:0 and cpu:0;gpu:0 because my production env only has one GPU, so I didn't make it a pull request.

int shrink_arena = 0;
  triton::common::TritonJson::Value params;
  if (model_state->ModelConfig().Find("parameters", &params)) {
    THROW_IF_BACKEND_MODEL_ERROR(TryParseModelStringParameter(
        params, "enable_arena_shrinkage", &shrink_arena, 0));
  }

  if(shrink_arena){
#ifdef TRITON_ENABLE_GPU
    THROW_IF_BACKEND_INSTANCE_ORT_ERROR(ort_api->AddRunConfigEntry(runOptions_, "memory.enable_memory_arena_shrinkage", "cpu:0;gpu:0"));
#else
    THROW_IF_BACKEND_INSTANCE_ORT_ERROR(ort_api->AddRunConfigEntry(runOptions_, "memory.enable_memory_arena_shrinkage", "cpu:0"));
#endif
  }

Half of CPU threads not utilized when running GPU model

Description

I noticed a pattern in CPU utilization when I ran the same GPU model on two VM:

both with 1 T4 GPU, one with 16 cores and one with 8 cores.
Standard_NC16as_T4_v3 and Standard_NC8as_T4_v3
https://docs.microsoft.com/en-us/azure/virtual-machines/nct4-v3-series

When I run the same model, 16-core machine shows 9/16 cores used by Triton server (number of triton process is 10), but 8-core machine shows 5/8 cores used. At the same time, the speed I get from the 8-core machine is half that of 16-core.

This looks like programmatically Triton server is only able to utilize half of the CPU threads. This points to a CPU bottleneck because both set-ups have exactly the same GPU.

The speed on 16-core machine

Running 1m test @ http://127.0.0.1:5001/score
  16 threads and 16 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    95.24ms   38.37ms 675.90ms   93.78%
    Req/Sec    11.16      3.76    20.00     81.36%
  10343 requests in 1.00m, 4.93MB read
Requests/sec:    172.12
Transfer/sec:     84.03KB

The speed on 8-core machine (notice the RPS reduced by half)

Running 1m test @ http://127.0.0.1:5001/score
  16 threads and 16 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   212.91ms   69.24ms 918.24ms   78.50%
    Req/Sec     5.35      2.35    20.00     79.50%
  4534 requests in 1.00m, 2.16MB read
Requests/sec:     75.45
Transfer/sec:     36.84KB

Triton Information

Triton version: 21.12
using the triton container directly, not building ourselves

To Reproduce

The pattern is deterministically reproduced when sending the same queries to the GPU Triton model with ORT backend. When directly running the model in ORT, there is no such pattern.

worker_count=4
docker run -d -v $(pwd):/var/azureml-app --gpus=all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p5001:5001 -p8002:8002 -e "OMP_WAIT_POLICY=PASSIVE" -e "AZUREML_MODEL_DIR=model_dir" -e "WORKER_COUNT=${worker_count}" -e "WORKER_PRELOAD=false" -e 'AZUREML_ENTRY_SCRIPT=fluency_score.py' -e "AZUREML_EXTRA_REQUIREMENTS_TXT=requirements.txt" shmaheacr.azurecr.io/tritonserver-inference:21.12-triton-flag

wrk -s ~/post_prepost_multi_input.lua -c 8 -t 8 -d 1m http://127.0.0.1:5001/score

Expected behavior
No matter how many hardware CPU, the number of utilized cores should not differ.

Bug for Built target onnxruntime_providers

Description
A clear and concise description of what the bug is.
Thanks for your good job!
When i build the triton server with docker with onnx backend, i meet so many warnings and the pregress is broken in [ 77%] Built target onnxruntime_providers.
Do you still meet these warnings in the building process ?
Why these errors happen?

Triton Information
What version of Triton are you using?
main
Are you using the Triton container or did you build it yourself?
build myself.
To Reproduce
Steps to reproduce the behavior.
I have built the buildbase container.
a. And in the progress of building builder container and installing onnx backend, i got some unexpected errors in Step 37/49 : RUN ./build.sh.
b. We meet so many warnings like the following warnings.
warning: ignoring return value of ‘write’, declared with attribute warn_unused_result [-Wunused-result]
warning: enumeration value ‘AttributeProto_AttributeType_TYPE_PROTO’ not handled in switch [-Wswitch]
/workspace/onnxruntime/cmake/external/onnx/onnx/defs/parser.cc:197:10: note: ‘dblval’ was declared here
/workspace/onnxruntime/cmake/external/protobuf/src/google/protobuf/repeated_field.h:1374:5: warning: ‘floatval’ may be used uninitialized in this function [-Wmaybe-uninitialized]
/workspace/onnxruntime/onnxruntime/core/providers/cuda/tensor/pad.cc:136:35: warning: comparison of integer expressions of different signedness: ‘int32_t’ {aka ‘int’} and ‘std::vector<long int>::size_type’ {aka ‘long unsigned int’} [-Wsign-compare]
c. In the physical machine for building the triton server, we do not install the graphics card like V100 or others. We think it does not matter. Is that right?

partial logs as following:

Step 1/49 : ARG BASE_IMAGE=nvcr.io/nvidia/tritonserver:21.12-py3-min
Step 2/49 : ARG ONNXRUNTIME_VERSION=1.10.0
Step 3/49 : ARG ONNXRUNTIME_REPO=https://github.com/microsoft/onnxruntime
Step 4/49 : ARG ONNXRUNTIME_BUILD_CONFIG=Release
Step 5/49 : ARG ONNXRUNTIME_OPENVINO_VERSION=2021.2.200
Step 6/49 : FROM ${BASE_IMAGE}
Step 8/49 : ENV DEBIAN_FRONTEND=noninteractive
Step 9/49 : RUN sed -i 's/archive.ubuntu.com/mirrors.ustc.edu.cn/g' /etc/apt/sources.list
Step 10/49 : RUN sed -i 's/security.ubuntu.com/mirrors.ustc.edu.cn/g' /etc/apt/sources.list
Step 35/49 : RUN sed -i 's/set_target_properties(onnxruntime PROPERTIES VERSION ${ORT_VERSION})//'
Step 36/49 : ENV CUDACXX="/usr/local/cuda/bin/nvcc"
Step 37/49 : RUN ./build.sh ${COMMON_BUILD_ARGS} --update --build --use_cuda --cuda_home "/usr/local/cuda" --cudnn_home "/usr/local/cudnn-8.3/cuda" --use_tensorrt --tensorrt_home "/usr/src/tensorrt" --use_openvino CPU_FP32
...
[91m/workspace/onnxruntime/cmake/external/pytorch_cpuinfo/deps/clog/src/clog.c: In function ‘clog_vlog_fatal’:
/workspace/onnxruntime/cmake/external/pytorch_cpuinfo/deps/clog/src/clog.c:112:4: warning: ignoring return value of ‘write’, declared with attribute warn_unused_result [-Wunused-result]
112 | write(STDERR_FILENO, out_buffer, prefix_chars + format_chars + CLOG_SUFFIX_LENGTH);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...
[ 77%] Built target onnxruntime_providers
�[91mmake: *** [Makefile:166: all] Error 2
�[0m�[91mTraceback (most recent call last):
File "/workspace/onnxruntime/tools/ci_build/build.py", line 2362, in
sys.exit(main())
File "/workspace/onnxruntime/tools/ci_build/build.py", line 2282, in main
build_targets(args, cmake_path, build_dir, configs, num_parallel_jobs, args.target)
File "/workspace/onnxruntime/tools/ci_build/build.py", line 1174, in build_targets
run_subprocess(cmd_args, env=env)
File "/workspace/onnxruntime/tools/ci_build/build.py", line 639, in run_subprocess
return run(*args, cwd=cwd, capture_stdout=capture_stdout, shell=shell, env=my_env)
File "/workspace/onnxruntime/tools/python/util/run.py", line 42, in run
completed_process = subprocess.run(
File "/usr/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/bin/cmake', '--build', '/workspace/build/Release', '--config', 'Release', '--', '-j8']' returned non-zero exit status 2.
�[0mThe command '/bin/sh -c ./build.sh ${COMMON_BUILD_ARGS} --update --build --use_cuda --cuda_home "/usr/local/cuda" --cudnn_home "/usr/local/cudnn-8.3/cuda" --use_tensorrt --tensorrt_home "/usr/src/tensorrt" --use_openvino CPU_FP32' returned a non-zero code: 1
make[2]: *** [CMakeFiles/ort_target.dir/build.make:74: onnxruntime/lib/libonnxruntime.so] Error 1
make[2]: Leaving directory '/tmp/tritonbuild/onnxruntime/build'
make[1]: *** [CMakeFiles/Makefile2:143: CMakeFiles/ort_target.dir/all] Error 2
make[1]: Leaving directory '/tmp/tritonbuild/onnxruntime/build'
make: *** [Makefile:136: all] Error 2
error: make install failed
platform linux
machine x86_64
version 2.19.0dev
default repo-tag: main
backend "ensemble" at tag/branch "main"
backend "identity" at tag/branch "main"
backend "repeat" at tag/branch "main"
backend "square" at tag/branch "main"
backend "onnxruntime" at tag/branch "main"
backend "pytorch" at tag/branch "main"
repoagent "checksum" at tag/branch "main"
Building Triton Inference Server
component "common" at tag/branch "main"
component "core" at tag/branch "main"
component "backend" at tag/branch "main"
component "thirdparty" at tag/branch "main"
mkdir: /tmp/tritonbuild/tritonserver/build
cmake ['-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX:PATH=/tmp/tritonbuild/tritonserver/install', '-DTRITON_COMMON_REPO_TAG:STRING=main', '-DTRITON_CORE_REPO_TAG:STRING=main', '-DTRITON_BACKEND_REPO_TAG:STRING=main', '-DTRITON_THIRD_PARTY_REPO_TAG:STRING=main', '-DTRITON_ENABLE_LOGGING:BOOL=ON', '-DTRITON_ENABLE_STATS:BOOL=ON', '-DTRITON_ENABLE_METRICS:BOOL=ON', '-DTRITON_ENABLE_METRICS_GPU:BOOL=ON', '-DTRITON_ENABLE_TRACING:BOOL=ON', '-DTRITON_ENABLE_NVTX:BOOL=OFF', '-DTRITON_ENABLE_GPU:BOOL=ON', '-DTRITON_MIN_COMPUTE_CAPABILITY=6.0', '-DTRITON_ENABLE_MALI_GPU:BOOL=OFF', '-DTRITON_ENABLE_GRPC:BOOL=ON', '-DTRITON_ENABLE_HTTP:BOOL=ON', '-DTRITON_ENABLE_SAGEMAKER:BOOL=OFF', '-DTRITON_ENABLE_VERTEX_AI:BOOL=OFF', '-DTRITON_ENABLE_GCS:BOOL=ON', '-DTRITON_ENABLE_S3:BOOL=ON', '-DTRITON_ENABLE_AZURE_STORAGE:BOOL=ON', '-DTRITON_ENABLE_TENSORFLOW:BOOL=OFF', '-DTRITON_ENABLE_ENSEMBLE:BOOL=ON', '-DTRITON_ENABLE_ONNXRUNTIME:BOOL=ON', '-DTRITON_ENABLE_PYTHON:BOOL=OFF', '-DTRITON_ENABLE_DALI:BOOL=OFF', '-DTRITON_ENABLE_PYTORCH:BOOL=ON', '-DTRITON_ENABLE_OPENVINO:BOOL=OFF', '-DTRITON_ENABLE_FIL:BOOL=OFF', '-DTRITON_ENABLE_FASTERTRANSFORMER:BOOL=OFF', '-DTRITON_ENABLE_TENSORRT:BOOL=OFF', '-DTRITON_ENABLE_NVTX:BOOL=OFF', '-DTRITON_ENABLE_ARMNN_TFLITE:BOOL=OFF', '-DTRT_VERSION=8.2.1.8+cuda11.4.2.006', '-DDALI_VERSION=1.8.0', '/workspace/build']
make server
mkdir: /tmp/tritonbuild/install
cpdir: /tmp/tritonbuild/tritonserver/install -> /tmp/tritonbuild/install
mkdir: /tmp/tritonbuild
git clone of repo "identity_backend" at tag "main"
mkdir: /tmp/tritonbuild/identity/build
we achieve here.
cmake ['-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX:PATH=/tmp/tritonbuild/identity/install', '-DTRITON_COMMON_REPO_TAG:STRING=main', '-DTRITON_CORE_REPO_TAG:STRING=main', '-DTRITON_BACKEND_REPO_TAG:STRING=main', '-DTRITON_ENABLE_GPU:BOOL=ON', '-DTRITON_ENABLE_MALI_GPU:BOOL=OFF', '-DTRITON_ENABLE_STATS:BOOL=ON', '-DTRT_VERSION=8.2.1.8+cuda11.4.2.006', '-DDALI_VERSION=1.8.0', '..']
make install
rmdir: /tmp/tritonbuild/install/backends/identity
mkdir: /tmp/tritonbuild/install/backends/identity
cpdir: /tmp/tritonbuild/identity/install/backends/identity -> /tmp/tritonbuild/install/backends/identity
mkdir: /tmp/tritonbuild
git clone of repo "repeat_backend" at tag "main"
mkdir: /tmp/tritonbuild/repeat/build
we achieve here.
cmake ['-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX:PATH=/tmp/tritonbuild/repeat/install', '-DTRITON_COMMON_REPO_TAG:STRING=main', '-DTRITON_CORE_REPO_TAG:STRING=main', '-DTRITON_BACKEND_REPO_TAG:STRING=main', '-DTRITON_ENABLE_GPU:BOOL=ON', '-DTRITON_ENABLE_MALI_GPU:BOOL=OFF', '-DTRITON_ENABLE_STATS:BOOL=ON', '-DTRT_VERSION=8.2.1.8+cuda11.4.2.006', '-DDALI_VERSION=1.8.0', '..']
make install
rmdir: /tmp/tritonbuild/install/backends/repeat
mkdir: /tmp/tritonbuild/install/backends/repeat
cpdir: /tmp/tritonbuild/repeat/install/backends/repeat -> /tmp/tritonbuild/install/backends/repeat
mkdir: /tmp/tritonbuild
git clone of repo "square_backend" at tag "main"
mkdir: /tmp/tritonbuild/square/build
cmake ['-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX:PATH=/tmp/tritonbuild/square/install', '-DTRITON_COMMON_REPO_TAG:STRING=main', '-DTRITON_CORE_REPO_TAG:STRING=main', '-DTRITON_BACKEND_REPO_TAG:STRING=main', '-DTRITON_ENABLE_GPU:BOOL=ON', '-DTRITON_ENABLE_MALI_GPU:BOOL=OFF', '-DTRITON_ENABLE_STATS:BOOL=ON', '-DTRT_VERSION=8.2.1.8+cuda11.4.2.006', '-DDALI_VERSION=1.8.0', '..']
make install
rmdir: /tmp/tritonbuild/install/backends/square
mkdir: /tmp/tritonbuild/install/backends/square
cpdir: /tmp/tritonbuild/square/install/backends/square -> /tmp/tritonbuild/install/backends/square
mkdir: /tmp/tritonbuild
git clone of repo "onnxruntime_backend" at tag "main"
mkdir: /tmp/tritonbuild/onnxruntime/build
we achieve here.
cmake ['-DTRITON_BUILD_ONNXRUNTIME_VERSION=1.10.0', '-DTRITON_ENABLE_ONNXRUNTIME_TENSORRT:BOOL=ON', '-DTRITON_BUILD_CONTAINER_VERSION=21.12', '-DTRITON_ENABLE_ONNXRUNTIME_OPENVINO:BOOL=ON', '-DTRITON_BUILD_ONNXRUNTIME_OPENVINO_VERSION=2021.2.200', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX:PATH=/tmp/tritonbuild/onnxruntime/install', '-DTRITON_COMMON_REPO_TAG:STRING=main', '-DTRITON_CORE_REPO_TAG:STRING=main', '-DTRITON_BACKEND_REPO_TAG:STRING=main', '-DTRITON_ENABLE_GPU:BOOL=ON', '-DTRITON_ENABLE_MALI_GPU:BOOL=OFF', '-DTRITON_ENABLE_STATS:BOOL=ON', '-DTRT_VERSION=8.2.1.8+cuda11.4.2.006', '-DDALI_VERSION=1.8.0', '..']
make install
error: docker run tritonserver_builder failed
platform linux
machine x86_64
version 2.19.0dev
default repo-tag: main
container version 22.02dev
upstream container version 21.12
backend "ensemble" at tag/branch "main"
backend "identity" at tag/branch "main"
backend "repeat" at tag/branch "main"
backend "square" at tag/branch "main"
backend "onnxruntime" at tag/branch "main"
backend "pytorch" at tag/branch "main"
repoagent "checksum" at tag/branch "main"
buildbase container ['docker', 'build', '--network', 'bridge', '-f', '/tmp/citritonbuild/Dockerfile.buildbase', '--pull', '--cache-from=tritonserver_buildbase', '--cache-from=tritonserver_buildbase_cache0', '--cache-from=tritonserver_buildbase_cache1']
mkdir: /tmp/citritonbuild
buildbase env ['docker', 'run', '--rm', 'tritonserver_buildbase', 'env']
['docker', 'run', '--name', 'tritonserver_builder', '-w', '/workspace', '-v', '/var/run/docker.sock:/var/run/docker.sock', '--env', 'TRITONBUILD_TRT_VERSION=8.2.1.8+cuda11.4.2.006', '--env', 'TRITONBUILD_DALI_VERSION=1.8.0', 'tritonserver_buildbase', 'python3', './build.py', '--build-dir=/tmp/citritonbuild', '--enable-logging', '--enable-stats', '--enable-tracing', '--enable-metrics', '--enable-gpu-metrics', '--enable-gpu', '--filesystem=gcs', '--filesystem=azure_storage', '--filesystem=s3', '--endpoint=http', '--endpoint=grpc', '--repo-tag=common:main', '--repo-tag=core:main', '--repo-tag=backend:main', '--repo-tag=thirdparty:main', '--backend=ensemble', '--backend=identity:main', '--backend=repeat:main', '--backend=square:main', '--backend=onnxruntime:main', '--backend=pytorch:main', '--repoagent=checksum:main', '-v', '--no-container-build', '--version', '2.19.0dev', '--container-version', '22.02dev', '--upstream-container-version', '21.12', '--cmake-dir', '/workspace/build', '--build-dir', '/tmp/tritonbuild', '--install-dir', '/tmp/tritonbuild/install']

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).
./build.py --build-dir=/tmp/citritonbuild --enable-logging --enable-stats --enable-tracing --enable-metrics --enable-gpu-metrics --enable-gpu --filesystem=gcs --filesystem=azure_storage --filesystem=s3 --endpoint=http --endpoint=grpc --repo-tag=common:main --repo-tag=core:main --repo-tag=backend:main --repo-tag=thirdparty:main --backend=ensemble --backend=identity:main --backend=repeat:main --backend=square:main --backend=onnxruntime:main --backend=pytorch:main --repoagent=checksum:main -v
Expected behavior
A clear and concise description of what you expected to happen.
I wish to build the triton server successfully!

Update ORT to 1.8.1

The latest version of ORT is 1.8.1 and backend should be updated to use it.

Model loading failure: densenet_onnx fails to load due to "pthread_setaffinity_np" failure

Description

I am testing tritonserver on the example models fetched using this script:
https://github.com/triton-inference-server/server/blob/main/docs/examples/fetch_models.sh

triton server is run as follows:

export MODEL_PATH=/tmp/tensorrt-inference-server
/opt/tritonserver/bin/tritonserver  --strict-model-config=false --model-store=$MODEL_PATH/docs/examples/model_repository 2>&1 | tee $MODEL_PATH/svrStatus.txt

the server fails with:

I1130 21:40:16.147155 3120 server.cc:267] Timeout 29: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models

The densenet_onnx model fails to load with:

| densenet_onnx        | 1       | UNAVAILABLE: Internal: onnx runtime error 1: /workspace/onnxruntime/onnxruntime/core/platform/posix/env.cc:173 onnxruntime::{anonymous}::PosixThread::PosixThread(const char*, int, unsigned int (*)(int, Eigen::ThreadPoolInterface*), Eigen::ThreadPoolInterface*, const onnxruntime::ThreadOptions&) pthread_setaffinity_np failed, error code: 2 error msg: No such file or directory |

The container has has a restricted cpuset which likely contributes to the above failure:

cat /sys/fs/cgroup/cpuset/cpuset.cpus
9-12,49-52

The tritonserver works fine on another container whose cpuset looks like this:

cat /sys/fs/cgroup/cpuset/cpuset.cpus
0-255

Likely the onnxruntime threadoptions affinity setting has to match the cpuset.

Triton Information
What version of Triton are you using?
2.15.0

Are you using the Triton container or did you build it yourself?
using the nvidia ngc container tritonserver:21.10-py3

To Reproduce

Run the tritonserver container with a restricted cpuset.
Inside container:

MODEL_PATH=/tmp/tensorrt-inference-server


git clone https://github.com/NVIDIA/tensorrt-inference-server.git
cd ${MODEL_PATH}/docs/examples/
bash fetch_models.sh

/opt/tritonserver/bin/tritonserver  --strict-model-config=false --model-store=$MODEL_PATH/docs/examples/model_repository 2>&1 | tee $MODEL_PATH/svrStatus.txt

Expected behavior

there should be no failure to load the densenet_onnx model.

Batch Support Error Triton ONNX Backend

Description
Hello,

I have an ONNX model. I am sharing the input and output dimensions of this model below.

I need to deploy this model with Triton Inference Server.

Below is my config file:

name: "segmentation_model"
platform: "onnxruntime_onnx"
max_batch_size: 8
input [
{
name: "input"
data_type: TYPE_FP32
dims: [3, -1, -1 ]
}
]
output [
{
name: "output"
data_type: TYPE_INT64
dims: [ -1,-1,-1]
}
]

When I try to deploy the model with this configfile, I got below error and never solved.

Invalid argument: model 'segmentation_model', tensor 'output': for the model to support batching the shape should have at least 1 dimension and the first dimension must be -1; but shape expected by the model is [1,-1,-1,-1]

I need to batching inference (for example 7 images in one inference), when I try to start tritonserver without config file with this command:

tritonserver --strict-model-config=false --model-repository=triton_model_repository/

I can only inference single image, not batch size>1.

How can I solve this problem?

Thanks

Triton Information
What version of Triton are you using?
tritonserver:21.08-py3

Are you using the Triton container or did you build it yourself?
Docker container

Not able to load simple iris model: Getting error: `Unsupported ONNX Type 'ONNX_TYPE_SEQUENCE'`

Description
Getting an error "failed to load 'model_onnx' version 1: Unsupported: Unsupported ONNX Type 'ONNX_TYPE_SEQUENCE' for I/O 'output_probability', expected 'ONNX_TYPE_TENSOR'"

Triton Information
nvcr.io/nvidia/tritonserver:21.10-py3

Are you using the Triton container or did you build it yourself?
Triton container

To Reproduce
Steps to reproduce the behavior.

Train an iris model or use model: https://github.com/guyroyse/redisai-iris/blob/main/iris.onnx

# Train a model.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y)
clr = RandomForestClassifier()
clr.fit(X_train, y_train)

# Convert into ONNX format
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
initial_type = [('float_input', FloatTensorType([None, 4]))]
onx = convert_sklearn(clr, initial_types=initial_type)
with open("rf_iris.onnx", "wb") as f:
    f.write(onx.SerializeToString())

# Compute the prediction with ONNX Runtime
import onnxruntime as rt
import numpy
sess = rt.InferenceSession("rf_iris.onnx")
input_name = sess.get_inputs()[0].name
label_name = sess.get_outputs()[0].name
print(input_name)
print(label_name)
pred_onx = sess.run([label_name], {input_name: X_test.astype(numpy.float32)})[0]

Try to run the iris model on triton with onnx as backend.

tritonserver --strict-model-config=false  --model-repository=/models

{
    "error": "load failed for model 'model_onnx': version 1: Unsupported: Unsupported ONNX Type 'ONNX_TYPE_SEQUENCE' for I/O 'output_probability', expected 'ONNX_TYPE_TENSOR'.;\n"
}

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

model.onnx
no config.txt is required as it must be auto generated in case of onnx.

Expected behavior
model.onnx load should not fail.

Don't always calculate all outputs.

Determine what outputs are needed by the requests in the batch and only calculate those (TF backend contains a representative implementation).

CPU inference is much slower than with ONNX Runtime directly

Description
Our Electra-based model takes about 540 ms per inference on CPU with ONNX Runtime (via the mcr.microsoft.com/azureml/onnxruntime:v1.4.0 container). The same model run through Triton r21.02 takes 1000+ ms on average. We've also tried with Triton r20.09, same result.

Triton Information
21.02

Are you using the Triton container or did you build it yourself?
Container, nvcr.io/nvidia/tritonserver:21.02-py3 and nvcr.io/nvidia/tritonserver:20.09-py3.

To Reproduce

I cannot share the full model but it's a PyTorch Transformer-based model exported from HuggingFace to ONNX.

Expected behavior
The inference time on CPU in Triton should be about the same as in ONNX Runtime directly.

Support configuration of thread counts in Triton (and other ORT config)

Dockerfile gen script for building ORT libraries should condition on TRITON_ENABLE_GPU

Description
Currently the generated Dockerfile builds ORT with CUDA unconditionally, but CUDA components should only be built when we want GPU support (TRITON_ENABLE_GPU is ON)

https://github.com/triton-inference-server/onnxruntime_backend/blob/main/tools/gen_ort_dockerfile.py#L275

Triton Information
21.06

Segfault during L0_lifecycle testing of 21.08 onnxruntime_backend

Description

Running L0_lifecycle against 21.08 candidate build cuases segfault. Backtrace is:

#0  0x00007f6b652b418b in raise () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f6b65293859 in abort () from /usr/lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f6b652fe3ee in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#3  0x00007f6b6530647c in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#4  0x00007f6b65307cbc in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#5  0x00007f695c1bc35a in ?? () from /usr/local/cuda/compat/lib.real/libnvidia-ptxjitcompiler.so.1
#6  0x00007f695c1b98b2 in ?? () from /usr/local/cuda/compat/lib.real/libnvidia-ptxjitcompiler.so.1
#7  0x00007f695c1ba647 in ?? () from /usr/local/cuda/compat/lib.real/libnvidia-ptxjitcompiler.so.1
#8  0x00007f695bbd65ab in ?? () from /usr/local/cuda/compat/lib.real/libnvidia-ptxjitcompiler.so.1
#9  0x00007f695bbcc532 in ?? () from /usr/local/cuda/compat/lib.real/libnvidia-ptxjitcompiler.so.1
#10 0x00007f695bbd6df7 in ?? () from /usr/local/cuda/compat/lib.real/libnvidia-ptxjitcompiler.so.1
#11 0x00007f695bbd6e6b in ?? () from /usr/local/cuda/compat/lib.real/libnvidia-ptxjitcompiler.so.1
#12 0x00007f695bb6cfc5 in ?? () from /usr/local/cuda/compat/lib.real/libnvidia-ptxjitcompiler.so.1
#13 0x00007f695bb65643 in ?? () from /usr/local/cuda/compat/lib.real/libnvidia-ptxjitcompiler.so.1
#14 0x00007f695bb697c2 in ?? () from /usr/local/cuda/compat/lib.real/libnvidia-ptxjitcompiler.so.1
#15 0x00007f695bb6ac2c in ?? () from /usr/local/cuda/compat/lib.real/libnvidia-ptxjitcompiler.so.1
#16 0x00007f695bb5e05c in __cuda_CallJitEntryPoint ()
   from /usr/local/cuda/compat/lib.real/libnvidia-ptxjitcompiler.so.1
#17 0x00007f6b2dae5942 in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so
#18 0x00007f6b2db3410d in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so
#19 0x00007f6b2d8c7d7a in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so
#20 0x00007f6b2d98935a in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so
#21 0x00007f6b2d989adb in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so
#22 0x00007f6b21f843cc in ?? () from /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.11.0
#23 0x00007f6b21f739ee in ?? () from /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.11.0
#24 0x00007f6b21f8a934 in ?? () from /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.11.0
#25 0x00007f6b21f8c552 in ?? () from /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.11.0
#26 0x00007f6b21f823fe in ?? () from /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.11.0
#27 0x00007f6b21f6197a in ?? () from /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.11.0
#28 0x00007f6b21f94265 in cudaDeviceSynchronize ()
   from /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.11.0
#29 0x00007f69943ed530 in onnxruntime::CUDAExecutionProvider::CUDAExecutionProvider(onnxruntime::CUDAExecutionProviderInfo const&) ()
   from /opt/tritonserver/backends/onnxruntime/libonnxruntime_providers_cuda.so
#30 0x00007f69943fd8f7 in onnxruntime::CUDAProviderFactory::CreateProvider() ()
   from /opt/tritonserver/backends/onnxruntime/libonnxruntime_providers_cuda.so
#31 0x00007f6a069b531c in (anonymous namespace)::InitializeSession(OrtSessionOptions const*, std::unique_ptr<onnxruntime::InferenceSession, std::default_delete<onnxruntime::InferenceSession> >&, OrtPrepackedWeightsContainer*) () from /opt/tritonserver/backends/onnxruntime/libonnxruntime.so.1.8.1
--Type <RET> for more, q to quit, c to continue without paging--
#32 0x00007f6a069b604d in OrtApis::CreateSession(OrtEnv const*, char const*, OrtSessionOptions const*, OrtSession**) () from /opt/tritonserver/backends/onnxruntime/libonnxruntime.so.1.8.1
#33 0x00007f6b105b7e6d in triton::backend::onnxruntime::OnnxLoader::LoadSession(bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, OrtSessionOptions const*, OrtSession**) () from /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
#34 0x00007f6b105a5958 in triton::backend::onnxruntime::ModelState::LoadModel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, TRITONSERVER_instancegroupkind_enum, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, OrtSession**, OrtAllocator**, CUstream_st*) ()
   from /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
#35 0x00007f6b105a7ab9 in triton::backend::onnxruntime::ModelInstanceState::ModelInstanceState(triton::backend::onnxruntime::ModelState*, TRITONBACKEND_ModelInstance*) ()
   from /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
#36 0x00007f6b105a8022 in triton::backend::onnxruntime::ModelInstanceState::Create(triton::backend::onnxruntime::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::onnxruntime::ModelInstanceState**) () from /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
#37 0x00007f6b105a8466 in TRITONBACKEND_ModelInstanceInitialize ()
   from /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
#38 0x00007f6b65e61f35 in nvidia::inferenceserver::TritonModelInstance::CreateInstance(nvidia::inferenceserver::TritonModel*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, TRITONSERVER_instancegroupkind_enum, int, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&) () from /opt/tritonserver/bin/../lib/libtritonserver.so
#39 0x00007f6b65e63c34 in nvidia::inferenceserver::TritonModelInstance::CreateInstances(nvidia::inferenceserver::TritonModel*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::map--Type <RET> for more, q to quit, c to continue without paging--
<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > > > const&, inference::ModelConfig const&)
    () from /opt/tritonserver/bin/../lib/libtritonserver.so
#40 0x00007f6b65e5eb4e in nvidia::inferenceserver::TritonModel::Create(nvidia::inferenceserver::InferenceServer*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > > > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long, inference::ModelConfig const&, std::unique_ptr<nvidia::inferenceserver::TritonModel, std::default_delete<nvidia::inferenceserver::TritonModel> >*) ()
   from /opt/tritonserver/bin/../lib/libtritonserver.so
#41 0x00007f6b65ce35db in nvidia::inferenceserver::ModelRepositoryManager::BackendLifeCycle::CreateInferenceBackend(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > cons--Type <RET> for more, q to quit, c to continue without paging--
t&, long, nvidia::inferenceserver::ModelRepositoryManager::BackendLifeCycle::BackendInfo*) ()
   from /opt/tritonserver/bin/../lib/libtritonserver.so
#42 0x00007f6b65cf1681 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<nvidia::inferenceserver::Status (nvidia::inferenceserver::ModelRepositoryManager::BackendLifeCycle::*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long, nvidia::inferenceserver::ModelRepositoryManager::BackendLifeCycle::BackendInfo*), nvidia::inferenceserver::ModelRepositoryManager::BackendLifeCycle*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, long, nvidia::inferenceserver::ModelRepositoryManager::BackendLifeCycle::BackendInfo*> > >::_M_run() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#43 0x00007f6b656a2de4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#44 0x00007f6b65b20609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#45 0x00007f6b65390293 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

To Reproduce

Run L0_lifecycle using the following cut-down test.sh

REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
if [ "$#" -ge 1 ]; then
    REPO_VERSION=$1
fi
if [ -z "$REPO_VERSION" ]; then
    echo -e "Repository version must be specified"
    echo -e "\n***\n*** Test Failed\n***"
    exit 1
fi

export CUDA_VISIBLE_DEVICES=0

TEST_RESULT_FILE='test_results.txt'
CLIENT_LOG="./client.log"
LC_TEST=lifecycle_test.py

DATADIR=/data/inferenceserver/${REPO_VERSION}

SERVER=/opt/tritonserver/bin/tritonserver
source ../common/util.sh

RET=0
rm -fr *.log

LOG_IDX=0

# LifeCycleTest.test_model_control
rm -fr models config.pbtxt.*
mkdir models
for i in onnx ; do
    cp -r $DATADIR/qa_model_repository/${i}_float32_float32_float32 models/.
    cp -r $DATADIR/qa_ensemble_model_repository/qa_model_repository/simple_${i}_float32_float32_float32 models/.
    sed -i "s/max_batch_size:.*/max_batch_size: 1/" models/${i}_float32_float32_float32/config.pbtxt
    sed -i "s/max_batch_size:.*/max_batch_size: 1/" models/simple_${i}_float32_float32_float32/config.pbtxt
done

SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit \
             --exit-timeout-secs=5 --strict-model-config=false
             --strict-readiness=false"
SERVER_LOG="./inference_server_$LOG_IDX.log"
run_server
if [ "$SERVER_PID" == "0" ]; then
    echo -e "\n***\n*** Failed to start $SERVER\n***"
    cat $SERVER_LOG
    exit 1
fi

set +e
python $LC_TEST LifeCycleTest.test_model_control >>$CLIENT_LOG 2>&1
if [ $? -ne 0 ]; then
    echo -e "\n***\n*** Test Failed\n***"
    RET=1
else
    check_test_results $TEST_RESULT_FILE 1
    if [ $? -ne 0 ]; then
        cat $CLIENT_LOG
        echo -e "\n***\n*** Test Result Verification Failed\n***"
        RET=1
    fi
fi
set -e

kill $SERVER_PID
wait $SERVER_PID


if [ $RET -eq 0 ]; then
  echo -e "\n***\n*** Test Passed\n***"
fi

exit $RET

ORT backend causes Triton to crash for a failed inference run

Description
When ort_api->RunWithBinding(session_, runOptions_, io_binding_) returns a non-zero status code, the backend correctly calls SendErrorForResponses to return the error back to the client. However, it does not thwart the further progress of the pipeline and leads to segmentation fault. This causes the entire server to crash which is especially problematic in production environment.

Triton Information
21.08 containers

Are you using the Triton container or did you build it yourself?
Triton container

To Reproduce

We have a bert mini model which we can not share but the issue may be reproduced by similar model. When running inference with wrong inputs the client does report the encountered error:

But the server crashes:

See the backtrace of the segmentation fault below:

Thread 16 "tritonserver" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f3854b39000 (LWP 510)]
0x00007f3857a361e6 in OrtApis::CastTypeInfoToTensorInfo(OrtTypeInfo const*, OrtTensorTypeAndShapeInfo const**) () from /opt/tritonserver/backends/onnxruntime/libonnxruntime.so.1.8.1
(gdb) bt
#0  0x00007f3857a361e6 in OrtApis::CastTypeInfoToTensorInfo(OrtTypeInfo const*, OrtTensorTypeAndShapeInfo const**) () from /opt/tritonserver/backends/onnxruntime/libonnxruntime.so.1.8.1
#1  0x00007f386819a669 in triton::backend::onnxruntime::ModelInstanceState::ReadOutputTensors(unsigned long, std::vector<char const*, std::allocator<char const*> > const&, TRITONBACKEND_Request**, unsigned int, std::vector<TRITONBACKEND_Response*, std::allocator<TRITONBACKEND_Response*> >*) ()
   from /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
#2  0x00007f386819cb9a in triton::backend::onnxruntime::ModelInstanceState::ProcessRequests(TRITONBACKEND_Request**, unsigned int) ()
   from /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
#3  0x00007f386819ed16 in TRITONBACKEND_ModelInstanceExecute ()
   from /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
#4  0x00007f38b5488859 in std::_Function_handler<void (unsigned int, std::vector<std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_delete<nvidia::inferenceserver::InferenceRequest> >, std::allocator<std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_delete<nvidia::inferenceserver::InferenceRequest> > > >&&), nvidia::inferenceserver::TritonModel::Create(nvidia::inferenceserver::InferenceServer*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > > > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long, inference::ModelConfig const&, std::unique_ptr<nvidia::inferenceserver::TritonModel, std::default_delete<nvidia::inferenceserver::TritonModel> >*)::{lambda(unsigned int, std::vector<std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_delete<nvidia::inferenceserver::InferenceRequest> >, std::allocator<std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_delete<nvidia::inferenceserver::InferenceRequest> > > >&&)#2}>::_M_invoke(std::_Any_data const&, unsigned int&&, std::vector<std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_delete<nvidia::inferenceserver::InferenceRequest> >, std::allocator<std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_delete<nvidia::inferenceserver::InferenceRequest> > > >&&) ()
   from /opt/tritonserver/bin/../lib/libtritonserver.so
#5  0x00007f38b5285ec0 in nvidia::inferenceserver::DynamicBatchScheduler::SchedulerThread(unsigned int, int, std::shared_ptr<std::atomic<bool> > const&, std::promise<bool>*) ()
   from /opt/tritonserver/bin/../lib/libtritonserver.so
#6  0x00007f38b4cd1de4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007f38b514f609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#8  0x00007f38b49bf293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

It's mini BERT model where the fault was observed but the issue can be easily reproduced on the model hitting same error.

Expected behavior
The backend should not SEGFAULT when encountering failed inference run.

Why is it slower to use openvino than not ？

Why is it slower to use the following configuration than not ？

optimization { execution_accelerators {
cpu_execution_accelerator : [ {
name : "openvino"
}]
}}

Expose all string key/value configs instead of doing it piecemeal.

Is your feature request related to a problem? Please describe.
ORT exposes a bunch of string key/value configs here https://github.com/microsoft/onnxruntime/blob/master/include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h but none of them are exposed by this backend. It would be nice to have them exposed once and for all in a generic fashion as opposed to doing it piecemeal.

Describe the solution you'd like
See above.

Describe alternatives you've considered
There are none.

Additional context
None

ORT backend always returns tensor on CPU

Description
The ORT backend always returns output tensors on CPU even when the instance is on GPU - when run using BLS through the python backend.

Expected behavior
The output tensor should be on the GPU when the instance kind is GPU for the ONNX model.

Yolov3 onnx model not load

Description
Yolov3 onnx model not load

Triton Information
What version of Triton are you using?
2.20
Are you using the Triton container or did you build it yourself?
Yes, version 22.03

To Reproduce

download model from https://github.com/onnx/models/tree/main/vision/object_detection_segmentation/yolov3/model
config.pbtxr as follow
name: "yolov3-10_onnx"
platform: "onnxruntime_onnx"
max_batch_size: 128
input [
{
name: "input_1"
data_type: TYPE_FP32
format: FORMAT_NCHW
dims: [3, -1, -1 ]
},
{
name: "image_shape"
data_type: TYPE_FP32
dims: [2]
}
]
output [
{
name: "yolonms_layer_1/ExpandDims_1:0"
data_type: TYPE_FP32
dims: [-1, 4]
},
{
name: "yolonms_layer_1/ExpandDims_3:0"
data_type: TYPE_FP32
dims: [-1,-1]
},
{
name: "yolonms_layer_1/concat_2:0"
data_type: TYPE_INT32
dims: [-1]
}
]
instance_group [
{
kind: KIND_GPU
count: 1
gpus: 0
}
]
sudo docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/home/josephw/server/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:22.03-py3 tritonserver --model-repository=/models

And see the following error

0330 19:07:41.647411 1 model_repository_manager.cc:1186] failed to load 'yolov3-10_onnx' version 1: Invalid argument: model 'yolov3-10_onnx', tensor 'yolonms_layer_1/ExpandDims_1:0': for the model to support batching the shape should have at least 1 dimension and the first dimension must be -1; but shape expected by the model is [1,-1,4]

If the problem appears to be a bug in the execution of the model itself, first attempt to run the model directly in ONNX Runtime. What is the output from loading and running the model in ORT directly? If there is a problem running the model directly with ORT, please submit an issue in the microsoft/onnxruntime (github.com) project.

If the problem appears to be in Triton itself, provide detailed steps to reproduce the behavior in Triton.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

Expected behavior
A clear and concise description of what you expected to happen.
Expect to load successfully

ORT_DISABLE_ALL optimization level

Is your feature request related to a problem? Please describe.
Our model uses a dropout layer that is removed by the ORT optimizer. It's unusual, but we use dropout in the inference. The onnxruntime_backend doesn't support setting:

GraphOptimizationLevel optimization_level = GraphOptimizationLevel::ORT_DISABLE_ALL.

From doc:

level: Refers to the graph optimization level. By default all optimizations are enabled. Allowed values are -1 and 1. -1 refers to BASIC optimizations and 1 refers to basic plus extended optimizations like fusions. Please find the details [here](https://onnxruntime.ai/docs/performance/graph-optimizations.html)

optimization {
  graph : {
    level : 1
}}

From source code:

GraphOptimizationLevel optimization_level =
    GraphOptimizationLevel::ORT_ENABLE_ALL;
{
  triton::common::TritonJson::Value optimization;
  if (ModelConfig().Find("optimization", &optimization)) {
    triton::common::TritonJson::Value graph;
    if (optimization.Find("graph", &graph)) {
      int64_t graph_level = 0;
      THROW_IF_BACKEND_MODEL_ERROR(graph.MemberAsInt("level", &graph_level));
      if (graph_level == -1) {
        optimization_level = GraphOptimizationLevel::ORT_ENABLE_BASIC;
      } else if (graph_level == 1) {
        optimization_level = GraphOptimizationLevel::ORT_ENABLE_EXTENDED;
      }
    }
  }
}
THROW_IF_BACKEND_MODEL_ORT_ERROR(
    ort_api->SetSessionGraphOptimizationLevel(soptions, optimization_level));

Describe the solution you'd like
For example, add the constant 0 for ORT_DISABLE_ALL.

Describe alternatives you've considered
Unfortunately there're no alternatives. Custom backend rebuild.

Additional context
I can add PR. What do you think?

Error in onnxruntime-openvino backend when run with Triton

Description
The OnnxRt-Openvino backend produces the errors when ran with Triton. The error shows up when running the BERT onnx model from the zoo. However, when the same model is ran from the Jupyter notebook outside of Triton with OnnxRT-openvino backend it produces the desired outputs.

Triton Information
Triton server container v21.10

Are you using the Triton container or did you build it yourself? - using container v21.10

To Reproduce

Download the BERT onnx model from the onnx zoo
The following is the config.pbtxt which uses the Openvino accelerator

name: "bert_onnx_cpu_i0"
platform: "onnxruntime_onnx"
max_batch_size: 16
input {
  name: "unique_ids_raw_output___9:0"
  data_type: TYPE_INT64
  dims: 1
  reshape {
  }
}
input {
  name: "segment_ids:0"
  data_type: TYPE_INT64
  dims: 256
}
input {
  name: "input_mask:0"
  data_type: TYPE_INT64
  dims: 256
}
input {
  name: "input_ids:0"
  data_type: TYPE_INT64
  dims: 256
}
output {
  name: "unstack:1"
  data_type: TYPE_FP32
  dims: 256
}
output {
  name: "unstack:0"
  data_type: TYPE_FP32
  dims: 256
}
output {
  name: "unique_ids:0"
  data_type: TYPE_INT64
  dims: 1
  reshape {
  }
}
instance_group {
  count: 2
  kind: KIND_CPU
}
dynamic_batching {
  preferred_batch_size: 2
  max_queue_delay_microseconds: 300
}
optimization {
  execution_accelerators {
    cpu_execution_accelerator {
      name: "openvino"
    }
  }
}

Run the perf_analyzer on the Triton hosted model and get the following error

2021-12-06 20:30:49.669 INFO[perf_analyzer.py:258] Running perf_analyzer ['perf_analyzer', '-m', 'bert_onnx_cpu_i1', '-b', '1', '-u', 'localhost:8001', '-i', 'grpc', '--measurement-interval', '10000', '--concurrency-range', '1', '--measurement-mode', 'time_windows'] failed with exit status 1 : *** Measurement Settings ***
  Batch size: 1
  Using "time_windows" mode for stabilization
  Measurement window: 10000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
Failed to maintain requested inference load. Worker thread(s) failed to generate concurrent requests.
Thread [0] had error: onnx runtime error 6: Non-zero status code returned while running OpenVINO-EP-subgraph_5 node. Name:'OpenVINOExecutionProvider_OpenVINO-EP-subgraph_5_1' Status Message: Cannot find blob with name: input_ids:0

Triton-OnnxRt- TRT performance i

Description
I downloaded the yolov3 model weights from here. Then using the Tensor-Rt sample scripts, I was able to get the corresponding onnx model file. The obtained onnx model file is similar to the one downloaded from the onnx model zoo (which uses the same weights but converted using keras2onnx)

Next, I ran the perf analyzer on this onnx model using different backends and got the following:

Triton-ONNXRT-CUDA: Used the .onnx model file and run with the onnxruntime backend and got the following output

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 0.6 infer/sec, latency 1498616 usec
Concurrency: 2, throughput: 0.8 infer/sec, latency 2237485 usec
Concurrency: 3, throughput: 0.6 infer/sec, latency 3406846 usec
Concurrency: 4, throughput: 0.6 infer/sec, latency 4570913 usec

Triton-ONNXRT-TRT: Used the .onnx model file but added the gpu accelarator as tensorrt (still ran with the onnxruntime backend) and got the following output

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 1.2 infer/sec, latency 854637 usec
Concurrency: 2, throughput: 2 infer/sec, latency 1011748 usec
Concurrency: 3, throughput: 1.8 infer/sec, latency 1516845 usec
Concurrency: 4, throughput: 1.8 infer/sec, latency 2023850 usec

Triton-TRT: Converted the .onnx file to the .trt file. Ran with the tensor-rt backend and got following
Inferences/Second vs. Client Average Batch Latency

Concurrency: 1, throughput: 34.4 infer/sec, latency 29134 usec
Concurrency: 2, throughput: 66 infer/sec, latency 30218 usec
Concurrency: 3, throughput: 64.6 infer/sec, latency 46344 usec
Concurrency: 4, throughput: 70.8 infer/sec, latency 56346 usec

Why is the performance on the Triton-OnnxRT-TRT backend slow compared to the Triton-TRT backend. I used the Quadro RTX 8000 (same Turing architecture as T4) for this experiment.

Triton Information
NGC container v20.12

Improve auto-complete to handle more cases.

For example, allow user to specify max-batch-size but not inputs and outputs.

[E:onnxruntime:, sequential_executor.cc:333 Execute]

Any Idea? Please help. Thank you so much.

[E:onnxruntime:, sequential_executor.cc:333 Execute] Non-zero status code returned while running Add node. Name:'Add_1103' Status Message: Add_1103: right operand cannot broadcast on dim 3 LeftShape: {1,3,20,20,2}, RightShape: {1,1,21,12,2}
ERROR: infer_trtis_server.cpp:258 Triton: TritonServer response error received., triton_err_str:Internal, err_msg:onnxruntime execute failure 1: Non-zero status code returned while running Add node. Name:'Add_1103' Status Message: Add_1103: right operand cannot broadcast on dim 3 LeftShape: {1,3,20,20,2}, RightShape: {1,1,21,12,2}
ERROR: infer_trtis_backend.cpp:586 TRTIS server failed to parse response with request-id:0 model:
ERROR: infer_trtis_backend.cpp:341 failed to specify dims after running inference failed on model:weapon, nvinfer error:NVDSINFER_TRTIS_ERROR

Cannot build `r22.01` onnxruntime_backend with OpenVino

Description
I was unable to build the onnxruntime_backend with OpenVino for Triton Inference Server r22.01 using compatible ONNXRuntime and OpenVino versions (from Triton Inference Server compatibility matrix).

Would you mind helping me out in building the onnxruntime_backend with OpenVino?

Triton Information
r22.01, building custom container for OpenVino.

To Reproduce

Steps (from Dockerfile):

# 1. Build OpenVino Runtime for ONNX
# NOTE: Based on https://github.com/openvinotoolkit/openvino/wiki/BuildingCode
RUN git clone --branch ${OPENVINO_VERSION} https://github.com/openvinotoolkit/openvino.git \
    && cd /workspace/openvino \
    && git submodule update --init --recursive \
    && cd /workspace/openvino \
    && chmod +x install_build_dependencies.sh \
    && ./install_build_dependencies.sh \
    && mkdir build \
    && cd /workspace/openvino/build \
    && cmake \
        -DCMAKE_BUILD_TYPE=Release \
        -DENABLE_LTO=ON \
        -DENABLE_MKL_DNN=ON \
        -DENABLE_CLDNN=OFF \
        -DENABLE_OPENCV=OFF \
        -DPYTHON_LIBRARY=/usr/lib/x86_64-linux-gnu/ \
        -DPYTHON_EXECUTABLE=`which python3.7` \
        -DPYTHON_INCLUDE_DIR=/usr/include/python3.7 \
        -DOpenCV_DIR=/opt/opencv \
        .. \
    && make --jobs=$(nproc --all) \
    && mkdir -p /opt/intel/openvino \
    && cmake --install . --prefix /opt/intel/openvino

# 2. Build onnxruntime
# NOTE: Based on https://onnxruntime.ai/docs/build/inferencing#cpu
# NOTE: ENV based on https://github.com/microsoft/onnxruntime/blob/0ae0f29f140c5d7b4077df024da75abd30367e58/tools/ci_build/github/linux/docker/Dockerfile.ubuntu_openvino#L15
ENV INTEL_OPENVINO_DIR /opt/intel/openvino
ENV LD_LIBRARY_PATH $INTEL_OPENVINO_DIR/deployment_tools/inference_engine/lib/intel64:$INTEL_OPENVINO_DIR/deployment_tools/ngraph/lib:$INTEL_OPENVINO_DIR/deployment_tools/inference_engine/external/tbb/lib:/usr/local/openblas/lib:$LD_LIBRARY_PATH
ENV InferenceEngine_DIR $INTEL_OPENVINO_DIR/deployment_tools/inference_engine/share
ENV ngraph_DIR $INTEL_OPENVINO_DIR/deployment_tools/ngraph/cmake
ENV PYTHONPATH $INTEL_OPENVINO_DIR/tools:$PYTHONPATH
ENV IE_PLUGINS_PATH $INTEL_OPENVINO_DIR/deployment_tools/inference_engine/lib/intel64
RUN git clone --recursive --branch v${ONNXRUNTIME_VERSION} https://github.com/Microsoft/onnxruntime \
    && cd /workspace/onnxruntime \
    && ./build.sh \ 
        --config Release \
        --build_shared_lib \
        --parallel \
        --use_openvino CPU_FP32

# 3. Build ONNX Backend for Nvidia Triton Inference Server
# NOTE: Based on https://github.com/triton-inference-server/onnxruntime_backend/tree/r22.01 
RUN git clone --branch r${TRITON_VERSION} https://github.com/triton-inference-server/onnxruntime_backend.git \
    && mkdir /workspace/onnxruntime_backend/build \
    && cd /workspace/onnxruntime_backend/build \
    && cmake -DTRITON_ONNXRUNTIME_DOCKER_BUILD=0 \
        -DTRITON_BACKEND_REPO_TAG=r${TRITON_VERSION} \
        -DTRITON_CORE_REPO_TAG=r${TRITON_VERSION} \
        -DTRITON_COMMON_REPO_TAG=r${TRITON_VERSION} \
        -DTRITON_ONNXRUNTIME_INCLUDE_PATHS=/workspace/onnxruntime/include/onnxruntime/core/session \
        -DTRITON_ONNXRUNTIME_LIB_PATHS=/workspace/onnxruntime/build/Linux/Release \
        -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install \
        -DTRITON_ENABLE_GPU=0 \
        -DTRITON_BUILD_ONNXRUNTIME_VERSION=${ONNXRUNTIME_VERSION} \
        -DTRITON_BUILD_CONTAINER_VERSION=${TRITON_CONTAINER_VERSION} \
        -DTRITON_ENABLE_ONNXRUNTIME_OPENVINO=ON \
        -DTRITON_BUILD_ONNXRUNTIME_OPENVINO_VERSION=${OPENVINO_VERSION} .. \
    && make install

ONNX Backend build failes due to missing InferenceEngine.

Expected behavior
ONNX Backend Should build and run without issues.

Guidance on building Onnx backend without docker

The current build without Docker documentation works on the latest release of triton, but the guidance on building the backend of ONNX is very vague:

"Some of the backends may use Docker as part of their build (for example ONNX Runtime and OpenVINO). If you don't want to use Docker in those cases you must consult the build process for those backends."

In looking at the Cmake file and the readme for the ONNX backend, there are still references to containers in the cmake step:
-DTRITON_BUILD_CONTAINER_VERSION

So it's not very clear for building without a container.

With the assumption that all dependencies for Triton are met (as we know they do since we are able to build from source), and the ONNX runtime is built and available, is it possible to get more clarity on the actual process for the backend without building it in a container?

Support for multiple streams in ORT

Need to surface what ever ORT implements.

tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] onnx runtime error 2: not enough space: expected 270080, got 261760

Description
When I enabled max_queue_delay_microseconds to improve the response speed of the model, I found that there were occasional errors. I set max_queue_delay_microseconds to 70000. Then I sent three tensor of different lengths to the service at the same time. The first request was successful, and the other two requests failed. If I don't configure max_queue_delay_microseconds, it will always be successful.
config.pbtxt:


name: "encoder"
backend: "onnxruntime"
default_model_filename: "encoder.onnx"

max_batch_size: 32
input [
  {
    name: "speech"
    data_type: TYPE_FP32
    dims: [-1, 80]
  },
  {
    name: "speech_lengths"
    data_type: TYPE_INT32
    dims: [1]
    reshape: { shape: [ ] }
  }
]

output [
  {
    name: "encoder_out_lens"
    data_type: TYPE_INT32
    dims: [1]
    reshape: { shape: [ ] }
  },
  {
    name: "beam_log_probs"
    data_type: TYPE_FP32
    dims: [-1, 10]
  },
  {
    name: "beam_log_probs_idx"
    data_type: TYPE_INT64
    dims: [-1, 10]
  }
]

dynamic_batching {
    max_queue_delay_microseconds: 70000
}


instance_group [
    {
      count: 1
      kind: KIND_GPU
    }
]

triton log:


I0107 07:17:51.879632 48160 grpc_server.cc:3157] Process for ModelInferHandler, rpc_ok=1, 1 step START
I0107 07:17:51.879667 48160 grpc_server.cc:3150] New request handler for ModelInferHandler, 4
I0107 07:17:51.879680 48160 model_repository_manager.cc:615] GetInferenceBackend() 'encoder' version -1
I0107 07:17:51.879691 48160 model_repository_manager.cc:615] GetInferenceBackend() 'encoder' version -1
I0107 07:17:51.879739 48160 infer_request.cc:547] prepared: [0x0x7fa4f40036a0] request id: , model: encoder, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7fa4f4003aa8] input: speech_lengths, type: INT32, original shape: [1,1], batch + shape: [1], shape: []
[0x0x7fa4f4003968] input: speech, type: FP32, original shape: [1,500,80], batch + shape: [1,500,80], shape: [500,80]
override inputs:
inputs:
[0x0x7fa4f4003968] input: speech, type: FP32, original shape: [1,500,80], batch + shape: [1,500,80], shape: [500,80]
[0x0x7fa4f4003aa8] input: speech_lengths, type: INT32, original shape: [1,1], batch + shape: [1], shape: []
original requested outputs:
beam_log_probs
beam_log_probs_idx
encoder_out_lens
requested outputs:
beam_log_probs
beam_log_probs_idx
encoder_out_lens

I0107 07:17:51.881097 48160 grpc_server.cc:3157] Process for ModelInferHandler, rpc_ok=1, 4 step START
I0107 07:17:51.881111 48160 grpc_server.cc:3150] New request handler for ModelInferHandler, 5
I0107 07:17:51.881116 48160 model_repository_manager.cc:615] GetInferenceBackend() 'encoder' version -1
I0107 07:17:51.881121 48160 model_repository_manager.cc:615] GetInferenceBackend() 'encoder' version -1
I0107 07:17:51.881133 48160 infer_request.cc:547] prepared: [0x0x7fa4f4026ba0] request id: , model: encoder, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7fa4f4026f88] input: speech_lengths, type: INT32, original shape: [1,1], batch + shape: [1], shape: []
[0x0x7fa4f4026e68] input: speech, type: FP32, original shape: [1,422,80], batch + shape: [1,422,80], shape: [422,80]
override inputs:
inputs:
[0x0x7fa4f4026e68] input: speech, type: FP32, original shape: [1,422,80], batch + shape: [1,422,80], shape: [422,80]
[0x0x7fa4f4026f88] input: speech_lengths, type: INT32, original shape: [1,1], batch + shape: [1], shape: []
original requested outputs:
beam_log_probs
beam_log_probs_idx
encoder_out_lens
requested outputs:
beam_log_probs
beam_log_probs_idx
encoder_out_lens

I0107 07:17:51.881257 48160 onnxruntime.cc:2325] model encoder, instance encoder_0, executing 1 requests
I0107 07:17:51.881270 48160 onnxruntime.cc:1277] TRITONBACKEND_ModelExecute: Running encoder_0 with 1 requests
I0107 07:17:51.883291 48160 grpc_server.cc:3157] Process for ModelInferHandler, rpc_ok=1, 5 step START
I0107 07:17:51.883303 48160 grpc_server.cc:3150] New request handler for ModelInferHandler, 6
I0107 07:17:51.883308 48160 model_repository_manager.cc:615] GetInferenceBackend() 'encoder' version -1
I0107 07:17:51.883314 48160 model_repository_manager.cc:615] GetInferenceBackend() 'encoder' version -1
I0107 07:17:51.883327 48160 infer_request.cc:547] prepared: [0x0x7fa4f4028bc0] request id: , model: encoder, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7fa4f4028fa8] input: speech_lengths, type: INT32, original shape: [1,1], batch + shape: [1], shape: []
[0x0x7fa4f4028e88] input: speech, type: FP32, original shape: [1,396,80], batch + shape: [1,396,80], shape: [396,80]
override inputs:
inputs:
[0x0x7fa4f4028e88] input: speech, type: FP32, original shape: [1,396,80], batch + shape: [1,396,80], shape: [396,80]
[0x0x7fa4f4028fa8] input: speech_lengths, type: INT32, original shape: [1,1], batch + shape: [1], shape: []
original requested outputs:
beam_log_probs
beam_log_probs_idx
encoder_out_lens
requested outputs:
beam_log_probs
beam_log_probs_idx
encoder_out_lens

2022-01-07 07:17:51.913799887 [I:onnxruntime:log, bfc_arena.cc:26 BFCArena] Creating BFCArena for Cuda with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 memory limit: 18446744073709551615 arena_extend_strategy: 0
2022-01-07 07:17:51.913824649 [V:onnxruntime:log, bfc_arena.cc:62 BFCArena] Creating 21 bins of max chunk size 256 to 268435456
2022-01-07 07:17:51.913839024 [I:onnxruntime:log, bfc_arena.cc:306 AllocateRawInternal] Extending BFCArena for Cuda. bin_num:9 (requested) num_bytes: 160000 (actual) rounded_bytes:160000
2022-01-07 07:17:51.914845468 [I:onnxruntime:log, bfc_arena.cc:186 Extend] Extended allocation by 1048576 bytes.
2022-01-07 07:17:51.914854356 [I:onnxruntime:log, bfc_arena.cc:189 Extend] Total allocated bytes: 1048576
2022-01-07 07:17:51.914859897 [I:onnxruntime:log, bfc_arena.cc:192 Extend] Allocated memory at 0x7fa4c6800000 to 0x7fa4c6900000
2022-01-07 07:17:51.915355558 [I:onnxruntime:, sequential_executor.cc:155 Execute] Begin execution
2022-01-07 07:17:51.915486867 [I:onnxruntime:log, bfc_arena.cc:306 AllocateRawInternal] Extending BFCArena for Cuda. bin_num:16 (requested) num_bytes: 19888128 (actual) rounded_bytes:19888128
2022-01-07 07:17:51.916460004 [I:onnxruntime:log, bfc_arena.cc:186 Extend] Extended allocation by 33554432 bytes.
2022-01-07 07:17:51.916473358 [I:onnxruntime:log, bfc_arena.cc:189 Extend] Total allocated bytes: 34603008
2022-01-07 07:17:51.916478883 [I:onnxruntime:log, bfc_arena.cc:192 Extend] Allocated memory at 0x7fa4c4000000 to 0x7fa4c6000000
2022-01-07 07:17:51.968223778 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.088785964 [I:onnxruntime:log, bfc_arena.cc:306 AllocateRawInternal] Extending BFCArena for Cuda. bin_num:14 (requested) num_bytes: 4825088 (actual) rounded_bytes:4825088
2022-01-07 07:17:53.089515990 [I:onnxruntime:log, bfc_arena.cc:186 Extend] Extended allocation by 33554432 bytes.
2022-01-07 07:17:53.089527911 [I:onnxruntime:log, bfc_arena.cc:189 Extend] Total allocated bytes: 68157440
2022-01-07 07:17:53.089534390 [I:onnxruntime:log, bfc_arena.cc:192 Extend] Allocated memory at 0x7fa492000000 to 0x7fa494000000
2022-01-07 07:17:53.089938002 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.162083719 [I:onnxruntime:log, bfc_arena.cc:306 AllocateRawInternal] Extending BFCArena for CUDA_CPU. bin_num:0 (requested) num_bytes: 32 (actual) rounded_bytes:256
2022-01-07 07:17:53.162112760 [I:onnxruntime:log, bfc_arena.cc:186 Extend] Extended allocation by 1048576 bytes.
2022-01-07 07:17:53.162119377 [I:onnxruntime:log, bfc_arena.cc:189 Extend] Total allocated bytes: 1048576
2022-01-07 07:17:53.162125287 [I:onnxruntime:log, bfc_arena.cc:192 Extend] Allocated memory at 0x7fa456649a80 to 0x7fa456749a80
2022-01-07 07:17:53.164154593 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.169864812 [I:onnxruntime:log, bfc_arena.cc:306 AllocateRawInternal] Extending BFCArena for CudaPinned. bin_num:0 (requested) num_bytes: 16 (actual) rounded_bytes:256
2022-01-07 07:17:53.169924722 [I:onnxruntime:log, bfc_arena.cc:186 Extend] Extended allocation by 1048576 bytes.
2022-01-07 07:17:53.169931112 [I:onnxruntime:log, bfc_arena.cc:189 Extend] Total allocated bytes: 1048576
2022-01-07 07:17:53.169936752 [I:onnxruntime:log, bfc_arena.cc:192 Extend] Allocated memory at 0x7fa536e00400 to 0x7fa536f00400
2022-01-07 07:17:53.170068013 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.409086735 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.412016989 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.414773028 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.613093211 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.624830871 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.636706915 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.760878275 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.768211342 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.774060870 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.892082624 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.902492334 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.924870691 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.050035890 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.053538891 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.056429109 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.182095031 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.184785148 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.186824910 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.330282315 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.350548533 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.353424064 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.482296545 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.484997936 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.487071625 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.607597040 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.610282649 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.612668352 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.733027130 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.736369952 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.739264251 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.857939191 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.861231945 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.912998291 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:55.031050395 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
I0107 07:17:55.133362 48160 infer_response.cc:165] add response output: output: beam_log_probs, type: FP32, shape: [1,124,10]
I0107 07:17:55.133413 48160 grpc_server.cc:2286] GRPC: using buffer for 'beam_log_probs', size: 4960, addr: 0x7fa455abc130
I0107 07:17:55.133427 48160 infer_response.cc:165] add response output: output: beam_log_probs_idx, type: INT64, shape: [1,124,10]
I0107 07:17:55.133435 48160 grpc_server.cc:2286] GRPC: using buffer for 'beam_log_probs_idx', size: 9920, addr: 0x7fa455abd4a0
I0107 07:17:55.133444 48160 infer_response.cc:165] add response output: output: encoder_out_lens, type: INT32, shape: [1]
I0107 07:17:55.133450 48160 grpc_server.cc:2286] GRPC: using buffer for 'encoder_out_lens', size: 4, addr: 0x7fa455ab8200
I0107 07:17:55.133460 48160 grpc_server.cc:3310] ModelInferHandler::InferResponseComplete, 1 step ISSUED
I0107 07:17:55.133482 48160 grpc_server.cc:2321] GRPC free: size 4960, addr 0x7fa455abc130
I0107 07:17:55.133487 48160 grpc_server.cc:2321] GRPC free: size 9920, addr 0x7fa455abd4a0
I0107 07:17:55.133491 48160 grpc_server.cc:2321] GRPC free: size 4, addr 0x7fa455ab8200
I0107 07:17:55.133617 48160 grpc_server.cc:2879] ModelInferHandler::InferRequestComplete
I0107 07:17:55.133633 48160 grpc_server.cc:3157] Process for ModelInferHandler, rpc_ok=1, 1 step COMPLETEI0107 07:17:55.133680 48160 onnxruntime.cc:2325] model encoder, instance encoder_0, executing 2 requests
I0107 07:17:55.133687 48160 onnxruntime.cc:1277] TRITONBACKEND_ModelExecute: Running encoder_0 with 2 requests

I0107 07:17:55.133739 48160 pinned_memory_manager.cc:161] pinned memory allocation: size 261760, addr 0x7fa6f8000090
I0107 07:17:55.133733 48160 grpc_server.cc:2195] Done for ModelInferHandler, 1
I0107 07:17:55.133953 48160 grpc_server.cc:3310] ModelInferHandler::InferResponseComplete, 4 step ISSUED
I0107 07:17:55.134012 48160 grpc_server.cc:2879] ModelInferHandler::InferRequestComplete
I0107 07:17:55.134021 48160 grpc_server.cc:3310] ModelInferHandler::InferResponseComplete, 5 step ISSUED
I0107 07:17:55.134026 48160 grpc_server.cc:3157] Process for ModelInferHandler, rpc_ok=1, 4 step COMPLETE
I0107 07:17:55.134036 48160 grpc_server.cc:2195] Done for ModelInferHandler, 4
I0107 07:17:55.134048 48160 grpc_server.cc:2879] ModelInferHandler::InferRequestComplete
I0107 07:17:55.134070 48160 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7fa6f8000090
I0107 07:17:55.134077 48160 grpc_server.cc:3157] Process for ModelInferHandler, rpc_ok=1, 5 step COMPLETE
I0107 07:17:55.134084 48160 grpc_server.cc:2195] Done for ModelInferHandler, 5
0107 07:21:34.168616 48276 pb_stub.cc:777] Non-graceful termination detected. 
0107 07:21:34.168644 48173 pb_stub.cc:777] Non-graceful termination detected.

client error log:
tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] onnx runtime error 2: not enough space: expected 270080, got 261760
Triton Information
triton server 21.11-py

After I turn off max_queue_delay_microseconds, triton will execute these three requests in sequence, and everything is normal. However, after max_queue_delay_microseconds is configured, it seems that it finally forced onnx to handle two requests and pinned the wrong memory size. As can be seen from the log of the client, my second request shape is 142280, and the size is 270080, but onnx can't get enough requests. Because triton pin the memory size of 261760. This is very confusing.

Re-use generated TensorRT plan when instance groups or multi-gpu

Is your feature request related to a problem? Please describe.
When using TensorRT accelerator for an ONNX model, if the model is duplicated either through instance groups, or across GPUs then the ONNX -> TensorRT conversion process is repeated each time. This takes 4 minutes to per conversion for my model so quickly becomes excessive.

Describe the solution you'd like
When using instance groups then the TensorRT model is only generated once, and the resulting TensorRT plan is duplicated for the other instances.

I understand this may be a little harder to do across GPUs, but as long as they have identical specs it should be possible.

Describe alternatives you've considered

Just waiting (which for me is then about 16 minutes for Triton to start).
Converting the model to TensorRT myself beforehand, but we deploy this setup to different models of GPU so this can't be a build step for us.

More options to control TensorRT execution provider.

Need to come up with config proto spec.

https://github.com/microsoft/onnxruntime/blob/master/docs/execution_providers/TensorRT-ExecutionProvider.md

how to make the onnx to support batch

Description
how to configure to make the onnx model can use batching?

Triton Information
E1207 07:13:40.323463 1639 model_repository_manager.cc:1890] Poll failed for model directory 'cat_dog_onnx_batch': model input has different size for dims and reshape for cat_dog_onnx_batch

Are you using the Triton container or did you build it yourself?
the container

To Reproduce
My onnx model is transformed from a caffe model with input setting as bellow.

input_param{
    shape: {dim: 2 dim: 3 dim: 32 dim: 32}
}

My config.pbtxt:

name: "cat_dog_onnx_batch"
platform: "onnxruntime_onnx"
max_batch_size : 2
input [
  {
    name: "data_input"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 32, 32 ]
    reshape { shape: [ 2, 3, 32, 32 ] }
  }
]
output [
  {
    name: "prob_Y"
    data_type: TYPE_FP32
    dims: [ 2 ]
    reshape { shape: [ 2, 2 ] }
    label_filename: "cat_dog_labels.txt"
  }
]

instance_group [
{
    count: 2
    kind: KIND_CPU
}]
dynamic_batching {
}

Expected behavior
dynamic_batching can be used in a model.onnx

Model hangs when warming up for tensorrt optimization

Description
ONNX with tensorrt optimization hangs/times out when launched with warmup.

Triton Information
What version of Triton are you using? r21.12

Are you using the Triton container or did you build it yourself? Using Triton container nvcr.io/nvidia/tritonserver:21.12-py3

To Reproduce
We have an image embedding network that we compiled from CLIP with ViT-b32 model.

We have a model config:

name: "clip_embedding_gpu"
platform: "onnxruntime_onnx"
max_batch_size: 128
input [{
  name: "IMAGE_PREPROCESSED"
  data_type: TYPE_FP32
  dims: [3, 224, 224]
}]
output [{
  name: "IMAGE_EMBEDDING"
  data_type: TYPE_FP16
  dims: [512]
}]

instance_group {
  kind: KIND_GPU
}
dynamic_batching {
  max_queue_delay_microseconds: 25
}
optimization {
  execution_accelerators {
    gpu_execution_accelerator: [
      {
        name: "tensorrt"
        parameters [
          {
            key: "precision_mode"
            value: "FP16"
          }
        ]
      }
    ]
  }
}

When starting up, we noticed that models were timing out for about 6~7 runs of perf_analyzer then it starts to produce results. We saw about 25% speedup.

Now, when we enable warm up:

model_warmup [{
  name: "warmup"
  batch_size: 128
  inputs: [{
    key: "IMAGE_PREPROCESSED"
    value: {
      data_type: TYPE_FP32
      dims: [3, 224, 224]
      zero_data: true
    }
  }]
}]

We saw a little delay in triton server startup, which is expected. However, perf_analyzer run now times out without every recovering.

Expected behavior
Once tritonserver started, perf_analyzer should show performance at an optimal level.

model with triton inference server is 3x slower than the model in ORT directly (using gpu in both)

Description
I run the model on triton inference server and also on ORT directly. Inference time on triton inference server is 3 ms, but it is 1 ms on ORT. In addition, there isn't any communication overhead while running the model on triton inference server.

Triton Information
Triton version I used is 22.01 and the ORT-gpu version is 1.9.0

I also used the docker image.

Expected behavior
The inference time of both scenarios be the same.

cudnn_home not valid during build

Description
I am not able to build the ONNX Backend. I am following the build instructions in the README but the build fails at Step 17.

Triton Information
Main Branch for Trition Version 21.02

To Reproduce

I am running DGX OS 5 (Ubuntu 20.04).

cmake -DCMAKE_INSTALL_PREFIX:PATH=pwd/install -DTRITON_BUILD_ONNXRUNTIME_VERSION=1.6.0 -DTRITO N_BUILD_CONTAINER_VERSION=21.02 ..
make install

Output:

Step 17/24 : RUN ./build.sh ${COMMON_BUILD_ARGS} --update --build --use_cuda --cuda_home "/usr/local/cuda" ---> Running in 3360f12bb769 2021-03-18 11:01:00,463 build [ERROR] - cuda_home and cudnn_home paths must be specified and valid. cuda_home='/usr/local/cuda' valid=True. cudnn_home='None' valid=False The command '/bin/sh -c ./build.sh ${COMMON_BUILD_ARGS} --update --build --use_cuda --cuda_home "/usr/local/cuda"' returned a non-zero code: 1 make[2]: *** [CMakeFiles/ort_target.dir/build.make:81: onnxruntime/lib/libonnxruntime.so.1.6.0] Fehler 1 make[1]: *** [CMakeFiles/Makefile2:158: CMakeFiles/ort_target.dir/all] Fehler 2 make: *** [Makefile:149: all] Fehler 2

Expected behavior
I expect the build to succed.

Build without docker

How to build onnxruntime_backend without using docker container?

I have built onnxruntime from the source.

Cannot build r22.03 onnxruntime_backend with tensorrt

Description
I was unable to build the onnxruntime_backend with OpenVino for Triton Inference Server r22.03 using compatible ONNXRuntime and tensorrt versions (from Triton Inference Server compatibility matrix).

Triton Information
r22.03

To Reproduce
follow the readme in onnxruntime_backend

cmake \
-DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install \
-DTRITON_BUILD_CUDNN_HOME='/usr/lib/x86_64-linux-gnu/' \
-DTRITON_BUILD_ONNXRUNTIME_VERSION=1.10.0 -DTRITON_BUILD_CONTAINER_VERSION=22.03 -DTRITON_ENABLE_ONNXRUNTIME_TENSORRT=ON \
-DTRITON_BACKEND_REPO_TAG=r22.03 -DTRITON_CORE_REPO_TAG=r22.03 -DTRITON_COMMON_REPO_TAG=r22.03 ..

And will see the error.
/usr/include/x86_64-linux-gnu/NvInferRuntimeCommon.h:56:10: fatal error: cuda_runtime_api.h: No such file or directory

Expected behavior
ONNX Backend Should build and run without issues.

How can I control the cuda memory for onnx models?

How can I control the cuda memory for onnx models? Is there any parameters, such as gpu-memory-fraction in tensorflow, which control the cuda memory allocated for models?

Improve autocomplete to make it more robust against partial model configuration

Is your feature request related to a problem? Please describe.
Currently, the auto-complete function does nothing if model config provides even a single output and input.
See here: https://github.com/triton-inference-server/onnxruntime_backend/blob/main/src/onnxruntime.cc#L652-L680.
The specification of the input/output may be incomplete. It may be missing dimension or data-type fields.
However, the auto-complete will still choose to skip completing these I/O configs.

Describe the solution you'd like
Instead of skipping auto-complete logic entirely when there are inputs/outputs present, auto-complete should
still ensure that all the I/O have complete configuration and not fail.

Describe alternatives you've considered
Alternative is for user to ensure they provide complete or no config information. This is little restrictive on the user.
For example, they can not only specify partial config for handling special cases and still use auto-complete to fill in the missing information for them.

perf_analyzer failing with inputs -1 dimension

Perf analyzer failing to create concurrency manager with the following error:

[Model Analyzer] Running perf_analyzer failed with exit status 1 : error: failed to create concurrency manager: input attention_mask contains dynamic shape, provide shapes to send along with the request

Using the following inputs which seem to be causing the issue:

    {
        name: "attention_mask"
        data_type: TYPE_BOOL
        dims: [
            -1
        ]
    },
    {
        name: "input_ids"
        data_type: TYPE_INT64
        dims: [
            -1
        ]
    }

Using r22.04 version of model analyzer.

Please suggest how we can analyze such a model

Support for Global Thread Pool (Sharing thread pool across ORT session)

QAS does this.

In Dockerfile gen script, CUDNN_VERSION should be obtained from docker image

Description
The generation looks for "CUDNN_VERSION" environment variable on host system at first, and later use the version in docker image. CUDNN ships with the docker image so it may differs from the one in host system (or the host system doesn't have CUDNN installed), so I think we should always use the version read from docker image when setting cudnn_home

failed to load onnx model with tensorrt optimization

Description
when i tried to optimize performance of my onnx model with tensorrt,
the triton inference server failed to startup with below error messge:

2020-12-16 10:33:41.984811584 [E:onnxruntime:, inference_session.cc:1186 operator()] Exception during initialization: 
/workspace/onnxruntime/onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc:504 SubGraphCollection_t onnxruntime::TensorrtExecutionProvider::GetSupportedList(SubGraphCollection_t, int, int, const onnxruntime::Provider_GraphViewer&, bool*) 
const [ONNXRuntimeError] : 1 : FAIL : TensorRT input: 504 has no shape specified.
 Please run shape inference on the onnx model first. Details can be found in https://github.com/microsoft/onnxruntime/blob/master/docs/execution_providers/TensorRT-ExecutionProvider.md
#shape-inference-for-tensorrt-subgraphs

Triton Information
What version of Triton are you using? : 20.11

Are you using the Triton container or did you build it yourself?
triton containter:
tritonserver: 20.11-py3

To Reproduce

model: onnx model
config.pbtxt

platform: "onnxruntime_onnx"
max_batch_size: 16
input [
    {
      name: "input_ids"
      data_type: TYPE_INT64
      dims: [ 128 ]
    },
    {
      name: "attention_mask"
      data_type: TYPE_INT64
      dims: [ 128 ]
    },
    {
      name: "token_type_ids"
      data_type: TYPE_INT64
      dims: [ 128 ]
    }
]
output [
    {
      name: "output_0"
      data_type: TYPE_FP32
      dims: [ 2 ]
    }
]
dynamic_batching {
    preferred_batch_size: [ 8, 16 ]
    max_queue_delay_microseconds: 1000
}
instance_group [
    {
        kind: KIND_GPU
        count: 2
    }
]

TRITIS failed to startup after add tensorrt optmization to the config.pbtxt

optimization { execution_accelerators {
  gpu_execution_accelerator : [ { name : "tensorrt" } ]
}}

Expected behavior
TRITIS should load models successfully

Memory leak in ONNX runtime backend

Description
A clear and concise description of what the bug is.
Memory leak in the ONNX runtime backend.

Triton Information
What version of Triton are you using?
main branch

Are you using the Triton container or did you build it yourself?
container

To Reproduce
The L0_infer_valgrind test can be used for repro. When running the test, TEST_VALGRIND=1 should be set to 1.

Expected behavior
A clear and concise description of what you expected to happen.

There should be no memory leaks.

==18067== 20,736 bytes in 1,296 blocks are definitely lost in loss record 184,459 of 184,803
==18067==    at 0x483E0F0: memalign (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==18067==    by 0x483E212: posix_memalign (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==18067==    by 0x14EA987A4: onnxruntime::utils::DefaultAlloc(unsigned long) (in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so.1.7.1)
==18067==    by 0x14E096CDC: OrtApis::GetBoundOutputValues(OrtIoBinding const*, OrtAllocator*, OrtValue***, unsigned long*) (in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so.1.7.1)
==18067==    by 0x14DD88D6A: triton::backend::onnxruntime::ModelInstanceState::ReadOutputTensors(unsigned long, std::vector<char const*, std::allocator<char const*> > const&, TRITONBACKEND_Request**, unsigned int, std::vector<TRITONBACKEND_Response*, std::allocator<TRITONBACKEND_Response*> >*) (in /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so)
==18067==    by 0x14DD8BA27: triton::backend::onnxruntime::ModelInstanceState::ProcessRequests(TRITONBACKEND_Request**, unsigned int) (in /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so)
==18067==    by 0x14DD8DCA2: TRITONBACKEND_ModelInstanceExecute (in /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so)
==18067==    by 0x4DE8386: std::_Function_handler<void (unsigned int, std::vector<std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_delete<nvidia::inferenceserver::InferenceRequest> >, std::allocator<std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_delete<nvidia::inferenceserver::InferenceRequest> > > >&&), nvidia::inferenceserver::TritonModel::Create(nvidia::inferenceserver::InferenceServer*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > > > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long, inference::ModelConfig const&, std::unique_ptr<nvidia::inferenceserver::TritonModel, std::default_delete<nvidia::inferenceserver::TritonModel> >*)::{lambda(unsigned int, std::vector<std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_delete<nvidia::inferenceserver::InferenceRequest> >, std::allocator<std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_delete<nvidia::inferenceserver::InferenceRequest> > > >&&)#2}>::_M_invoke(std::_Any_data const&, unsigned int&&, std::vector<std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_delete<nvidia::inferenceserver::InferenceRequest> >, std::allocator<std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_delete<nvidia::inferenceserver::InferenceRequest> > > >&&) (in /opt/tritonserver/lib/libtritonserver.so)
==18067==    by 0x4BEEB8F: nvidia::inferenceserver::DynamicBatchScheduler::SchedulerThread(unsigned int, int, std::shared_ptr<std::atomic<bool> > const&, std::promise<bool>*) (in /opt/tritonserver/lib/libtritonserver.so)
==18067==    by 0x5B9AD83: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==18067==    by 0x5741608: start_thread (pthread_create.c:477)
==18067==    by 0x5F33292: clone (clone.S:95)
***

Model Loading failure: Invalid argument: model output cannot have empty reshape for non-batching model for test_model

Description
When Trying to load an Onnx model with auto generated config file, the error was thrown:

E1006 22:22:40.180598 23016 model_repository_manager.cc:1186] failed to load 'ads_model' version 1: Invalid argument: model output cannot have empty reshape for non-batching model for test_model

Triton Information
What version of Triton are you using?
r21.09
Are you using the Triton container or did you build it yourself?
using nvcr.io/nvidia/tritonserver:21.09-py3
To Reproduce
Steps to reproduce the behavior.

docker run --rm -p8000:8000 -p8001:8001 -p8002:8002 -v$(pwd)/server/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:21.09-py3 tritonserver --model-repository=/models --strict-model-config=false

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).
output

{'name': 'test_1_shape', 'type': {'tensorType': {'elemType': 7, 'shape': {'dim': [{'dimValue': '1'}]}}}}
{'name': 'test_1_values', 'type': {'tensorType': {'elemType': 1, 'shape': {'dim': [{}]}}}}
{'name': 'test_1_indices', 'type': {'tensorType': {'elemType': 7, 'shape': {'dim': [{}]}}}}
{'name': 'test_2_shape', 'type': {'tensorType': {'elemType': 7, 'shape': {'dim': [{'dimValue': '2'}]}}}}
{'name': 'test_2_values', 'type': {'tensorType': {'elemType': 1, 'shape': {'dim': [{}]}}}}
{'name': 'test_2_indices', 'type': {'tensorType': {'elemType': 7, 'shape': {'dim': [{}]}}}}
{'name': 'test_result_squeezed', 'type': {'tensorType': {'elemType': 1, 'shape': {}}}}
{'name': 'test_3s_shape', 'type': {'tensorType': {'elemType': 7, 'shape': {'dim': [{'dimValue': '1'}]}}}}
{'name': 'test_3s_values', 'type': {'tensorType': {'elemType': 1, 'shape': {'dim': [{}]}}}}
{'name': 'test_3s_indices', 'type': {'tensorType': {'elemType': 7, 'shape': {'dim': [{}]}}}}
{'name': 'test_3_shape', 'type': {'tensorType': {'elemType': 7, 'shape': {'dim': [{'dimValue': '1'}]}}}}
{'name': 'test_3_values', 'type': {'tensorType': {'elemType': 1, 'shape': {'dim': [{}]}}}}
{'name': 'test_3_indices', 'type': {'tensorType': {'elemType': 7, 'shape': {'dim': [{}]}}}}

input:

{'name': 'test1_shape', 'type': {'tensorType': {'elemType': 7, 'shape': {'dim': [{'dimValue': '1'}]}}}}
{'name': 'test1_values', 'type': {'tensorType': {'elemType': 1, 'shape': {'dim': [{}]}}}}
{'name': 'test1_indices', 'type': {'tensorType': {'elemType': 7, 'shape': {'dim': [{}]}}}}
{'name': 'test2_id', 'type': {'tensorType': {'elemType': 7, 'shape': {'dim': [{'dimValue': '1'}]}}}}
{'name': 'test3_shape', 'type': {'tensorType': {'elemType': 7, 'shape': {'dim': [{'dimValue': '1'}]}}}}
{'name': 'test3_values', 'type': {'tensorType': {'elemType': 1, 'shape': {'dim': [{}]}}}}
{'name': 'test3_indices', 'type': {'tensorType': {'elemType': 7, 'shape': {'dim': [{}]}}}}
{'name': 'test3a_shape', 'type': {'tensorType': {'elemType': 7, 'shape': {'dim': [{'dimValue': '1'}]}}}}
{'name': 'test3a_values', 'type': {'tensorType': {'elemType': 1, 'shape': {'dim': [{}]}}}}
{'name': 'test3a_indices', 'type': {'tensorType': {'elemType': 7, 'shape': {'dim': [{}]}}}}
{'name': 'test3_y_industry_shape', 'type': {'tensorType': {'elemType': 7, 'shape': {'dim': [{'dimValue': '1'}]}}}}
{'name': 'test3_y_industry_values', 'type': {'tensorType': {'elemType': 1, 'shape': {'dim': [{}]}}}}
{'name': 'test3_y_industry_indices', 'type': {'tensorType': {'elemType': 7, 'shape': {'dim': [{}]}}}}
{'name': 'test3ry_shape', 'type': {'tensorType': {'elemType': 7, 'shape': {'dim': [{'dimValue': '1'}]}}}}
{'name': 'test3ry_values', 'type': {'tensorType': {'elemType': 1, 'shape': {'dim': [{}]}}}}
{'name': 'test3ry_indices', 'type': {'tensorType': {'elemType': 7, 'shape': {'dim': [{}]}}}}
{'name': 'test3y_shape', 'type': {'tensorType': {'elemType': 7, 'shape': {'dim': [{'dimValue': '1'}]}}}}
{'name': 'test3y_values', 'type': {'tensorType': {'elemType': 1, 'shape': {'dim': [{}]}}}}
{'name': 'test3y_indices', 'type': {'tensorType': {'elemType': 7, 'shape': {'dim': [{}]}}}}
{'name': 'test3d_shape', 'type': {'tensorType': {'elemType': 7, 'shape': {'dim': [{'dimValue': '1'}]}}}}
{'name': 'test3d_values', 'type': {'tensorType': {'elemType': 1, 'shape': {'dim': [{}]}}}}
{'name': 'test3d_indices', 'type': {'tensorType': {'elemType': 7, 'shape': {'dim': [{}]}}}}

Expected behavior
Model should be loading with auto generated config.

triton-inference-server / onnxruntime_backend Goto Github PK

onnxruntime_backend's Introduction

ONNX Runtime Backend

ONNX Runtime with TensorRT optimization

Parameter mapping between ONNX Runtime and Triton ONNXRuntime Backend

ONNX Runtime with CUDA Execution Provider optimization

ONNX Runtime with OpenVINO optimization

Other Optimization Options with ONNX Runtime

Model Config Options

Command line options

Thread Pools

Default Max Batch Size

onnxruntime_backend's People

Contributors

Stargazers

Watchers

Forkers

onnxruntime_backend's Issues

Recommend Projects

Recommend Topics

Recommend Org