A small tip which may be useful FP16 or INT8 does improve the infe

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

@Eloring TF : 1.14.0 TRT : 5.0.2.6 Thank you :D <p dir="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

No speed improvements after TF-TRT optimizing,about tensorflow/tensorrt

Comments (62)

hongym7 commented on August 15, 2024 8

I have same issue(?), too.
There is no big change.

from tensorrt.

huaifeng1993 commented on August 15, 2024 4

I soved the problem by using nvidia-docker with the tensorflow19.04 container.Refereing the Tensorflow-TensorRt-user-guid ,I find that there are only two ways to install TF-TRT:using container or compiling TensorFlow with TensorRT integration from its source.

from tensorrt.

EsmeYi commented on August 15, 2024 2

@BertrandD

saved_model_cli convert \
--dir "/home/yilrr/tf-serving/faster-rcnn/saved_model/versions/1" \
--output_dir "/home/yilrr/tf-serving/trt-frcnn" \
--tag_set serve \
tensorrt --precision_mode FP32 --max_batch_size 32 --is_dynamic_op True

the saved_model_cli convert tool will call tensorflow.contrib.tensorrt.create_inference_graph()

Here are logs during model conversion

2019-06-27 22:43:27.553644: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:581] Optimization results for grappler item: tf_graph
2019-06-27 22:43:27.553729: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:583]   constant folding: Graph size after: 6441 nodes (-490), 10465 edges (-509), time = 805.309ms.
2019-06-27 22:43:27.553742: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:583]   layout: Graph size after: 6468 nodes (27), 10492 edges (27), time = 253.081ms.
2019-06-27 22:43:27.553755: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:583]   constant folding: Graph size after: 6456 nodes (-12), 10485 edges (-7), time = 516.809ms.

from tensorrt.

hongym7 commented on August 15, 2024 1

@Eloring
TF : 1.14.0
TRT : 5.0.2.6

Thank you :D

I used your source.
This is log.

2019-07-01 18:08:03.789649: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:716] Optimization results for grappler item: tf_graph
2019-07-01 18:08:03.789675: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] constant folding: Graph size after: 2407 nodes (-648), 3139 edges (-660), time = 594.514ms.
2019-07-01 18:08:03.789710: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] layout: Graph size after: 2426 nodes (19), 3161 edges (22), time = 131.374ms.
2019-07-01 18:08:03.789714: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] constant folding: Graph size after: 2422 nodes (-4), 3159 edges (-2), time = 318.721ms.

Is it right?

from tensorrt.

ZhuoranLyu commented on August 15, 2024 1

@PetreanuAndi , actually, using tftrt under fp32 may not accelerate(instead it may slow the speed). However, it should accelerate under fp16 precision, especially on a new GPU like 2080Ti with tensor cores. I see a 3x times speed up using fp16 with a 2080Ti.

from tensorrt.

pooyadavoodi commented on August 15, 2024 1

For NMS, if you can use combined_non_max_suppression in your graph, then you get much better speedup, esp because TF-TRT optimizes that.

If you use the object detection API, you can use the submodule of tensorflow/models to get combined_nms as follows:

The config file that you need to change for NMS is pipeline.config.
In the post_processing section of the config file, there is batch_non_max_suppression that specifies NMS configurations. Add this new field to the NMS config: combined_nms: true

from tensorrt.

EsmeYi commented on August 15, 2024 1

@Programmerwyl
I met the same problem, i.e. tf.import_graph_def is too slow, which is because importing a *.pb model will call ParseFromString() and this function is provided by the protobuf.
I solved it by compiling a cpp-implemented protobuf from source, as I have recorded here.
Hope this is helpful.

from tensorrt.

BertrandD commented on August 15, 2024

Do you have the code you used to generate the TF-TRT version of your model? In your optimized graph, do you have any TRTEngineOp node?

len([1 for n in frozen_graph.node if str(n.op)=='TRTEngineOp'])

from tensorrt.

EsmeYi commented on August 15, 2024

@BertrandD

My TF version is 1.13.1, therefore I import tensorflow.contrib.tensorrt instead of tensorflow.python.compiler.tensorrt and use trt.create_inference_graph() instead of trt.TrtGraphConverter() in my code.
In order to deploying models on Tensorflow:Serving, I created the TF-TRT inference graph from a SavedModel. I don't know how to count TRTEngineOp node in a SavedModel.

I followed TF-TRT Workflow With A SavedModel and also tried saved_model_cli convert tool.

Any help would be grateful!

from tensorrt.

BertrandD commented on August 15, 2024

@Eloring Do you have any logs ? I had problems to create an optimized graph (from a rfcn model), but by playing and understandint the parameters I finally (with luck ?) got something working... By reading the 2 links you gave I cannot figure out the problem, maybe with logs I will see something...

from tensorrt.

EsmeYi commented on August 15, 2024

@huaifeng1993
I have tried to use container before:
docker pull nvcr.io/nvidia/tensorflow:19.05-py2
however the container does not support ppc64le (Power CPU):
standard_init_linux.go:178: exec user process caused "exec format error"

I supposed TF-TRT was built-in as default by tensorflow-gpu...(maybe I was wrong...)

thanks for your guide, I'll try to compile from source.

from tensorrt.

EsmeYi commented on August 15, 2024

@BertrandD

It seems the model was converted successfully with TF-TRT after I updated tensorflow-gpu from v1.13 to v1.14 and rebuilt TF-TRT env, where it used tensorflow.python.compiler.tensorrt instead of tensorflow.contrib.tensorrt.

2019-06-28 11:16:03.857404: I tensorflow/compiler/tf2tensorrt/segment/segment.cc:460] There are 2722 ops of 57 different types in the graph that are not converted to TensorRT: Sum, TopKV2, Select, CropAndResize, Fill, Split, Transpose, Where, Size, GatherV2, Greater, Equal, NonMaxSuppressionV3, Reshape, Add, ResizeBilinear, Assert, LoopCond, Merge, Squeeze, Enter, DataFormatVecPermute, ZerosLike, Less, Range, Placeholder, TensorArrayV3, TensorArraySizeV3, TensorArrayScatterV3, Cast, Maximum, StridedSlice, Shape, Minimum, Switch, TensorArrayReadV3, Prod, Identity, ExpandDims, ConcatV2, Unpack, RealDiv, Pad, Slice, LogicalAnd, Mul, Round, TensorArrayWriteV3, GreaterEqual, NoOp, Pack, Exit, NextIteration, TensorArrayGatherV3, Sub, Const, Tile, (For more information see https://docs.nvidia.com/deeplearning/dgx/tf-trt-user-guide/index.html#supported-ops).
2019-06-28 11:16:04.378135: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:733] Number of TensorRT candidate segments: 18
2019-06-28 11:16:04.684423: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2019-06-28 11:16:04.684771: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node ClipToWindow/TRTEngineOp_0 added for segment 0 consisting of 8 nodes succeeded.
2019-06-28 11:16:04.684937: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_1 added for segment 1 consisting of 4 nodes succeeded.
2019-06-28 11:16:04.685106: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_2 added for segment 2 consisting of 18 nodes succeeded.
2019-06-28 11:16:04.685303: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_3 added for segment 3 consisting of 18 nodes succeeded.
2019-06-28 11:16:04.685498: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_4 added for segment 4 consisting of 18 nodes succeeded.
2019-06-28 11:16:04.685696: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_5 added for segment 5 consisting of 18 nodes succeeded.
2019-06-28 11:16:04.705593: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_6 added for segment 6 consisting of 442 nodes succeeded.
2019-06-28 11:16:04.708003: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_7 added for segment 7 consisting of 4 nodes succeeded.
2019-06-28 11:16:04.708203: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_8 added for segment 8 consisting of 3 nodes succeeded.
2019-06-28 11:16:04.708369: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_9 added for segment 9 consisting of 3 nodes succeeded.
2019-06-28 11:16:04.708506: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node GridAnchorGenerator/TRTEngineOp_10 added for segment 10 consisting of 8 nodes succeeded.
2019-06-28 11:16:04.708626: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node GridAnchorGenerator/TRTEngineOp_11 added for segment 11 consisting of 3 nodes succeeded.
2019-06-28 11:16:04.708736: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node GridAnchorGenerator/TRTEngineOp_12 added for segment 12 consisting of 3 nodes succeeded.
2019-06-28 11:16:04.725830: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_13 added for segment 13 consisting of 169 nodes succeeded.
2019-06-28 11:16:04.727548: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_14 added for segment 14 consisting of 7 nodes succeeded.
2019-06-28 11:16:04.728181: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_15 added for segment 15 consisting of 7 nodes succeeded.
2019-06-28 11:16:04.728442: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node SecondStagePostprocessor/TRTEngineOp_16 added for segment 16 consisting of 8 nodes succeeded.
2019-06-28 11:16:04.728586: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node SecondStagePostprocessor/TRTEngineOp_17 added for segment 17 consisting of 7 nodes succeeded.
2019-06-28 11:16:04.945385: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:716] Optimization results for grappler item: tf_graph
2019-06-28 11:16:04.945483: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718]   constant folding: Graph size after: 6456 nodes (-475), 10488 edges (-486), time = 764.6ms.
2019-06-28 11:16:04.945501: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718]   layout: Graph size after: 6483 nodes (27), 10515 edges (27), time = 245.293ms.
2019-06-28 11:16:04.945517: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718]   constant folding: Graph size after: 6471 nodes (-12), 10508 edges (-7), time = 489.997ms.
2019-06-28 11:16:04.945540: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718]   TensorRTOptimizer: Graph size after: 5741 nodes (-730), 9719 edges (-789), time = 1155.79297ms.

from tensorrt.

hongym7 commented on August 15, 2024

I run fine-tune object detection model.
Get .pb file. and then run tenssort.
my log is below.

    graph_size(MB)(native_tf): 181.3
graph_size(MB)(trt): 182.1
num_nodes(native_tf): 2564
num_nodes(tftrt_total): 1594
num_nodes(trt_only): 0
time(s) (trt_conversion): 2.6404

Is it right ?

from tensorrt.

EsmeYi commented on August 15, 2024

@hongym7
num_nodes(trt_only): 0
that means your converted model doesn't have TensorRT node (i.e. TRTEngineOp)

from tensorrt.

hongym7 commented on August 15, 2024

@Eloring
Um... Thank you.
I need more research.
I'll let you know how to find it.

from tensorrt.

hongym7 commented on August 15, 2024

@Eloring
My source is

from tftrt.examples.object_detection import optimize_model
import tensorflow.contrib.tensorrt as trt
import tensorflow as tf

config_path = '/home/hong/PycharmProjects/tensorflow_models_drone/research/faster_rcnn_resnet101_drone_27.config'
checkpoint_path = '/home/hong/PycharmProjects/tensorflow_models_drone/research/train_result_drone_27/model.ckpt-18000'

frozen_graph = optimize_model(
config_path=config_path,
checkpoint_path=checkpoint_path,
use_trt=True,
precision_mode='FP16'
)

...

Is it right ?

ref : https://github.com/tensorflow/tensorrt/tree/master/tftrt/examples/object_detection#od_optimize

from tensorrt.

EsmeYi commented on August 15, 2024

@hongym7
What's your Tensorflow version and TensorRT version?
Can you show me the logs?

Here is my core code:

import tensorflow as tf
from tensorflow.python.compiler.tensorrt import trt_convert as trt
def frozen_graph_trt(
    input_frozen_graph_path,
    output_dir,
    max_batch_size,
    precision_mode,
    is_dynamic_op):
    '''
    create a TensorRT inference graph from a Frozen Graph
    '''
    output_node_names = [BOXES_NAME, CLASSES_NAME, SCORES_NAME, NUM_DETECTIONS_NAME]
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    output_frozen_graph_path = os.path.join(output_dir, 'trt_frozen_graph.pb')
    with tf.io.gfile.GFile(input_frozen_graph_path, 'rb') as f:
        graph_def = tf.compat.v1.GraphDef()
        graph_def.ParseFromString(f.read())

    trt_graph = trt.create_inference_graph(
        input_graph_def=graph_def,
        outputs=output_node_names,
        max_batch_size=max_batch_size,
        max_workspace_size_bytes=trt.DEFAULT_TRT_MAX_WORKSPACE_SIZE_BYTES,
        precision_mode=precision_mode,
        is_dynamic_op=False)

    with open(output_frozen_graph_path, 'wb') as f:
        f.write(trt_graph.SerializeToString())


def ckpt_trt():
    '''
    create a TensorRT inference graph from MetaGraph And Checkpoint Files
    '''
    # use tf.graph_util.convert_variables_to_constants freeze ckpt to frozen graph
    # and then use frozen_graph_trr()

from tensorrt.

BertrandD commented on August 15, 2024

@Eloring Good ! Be careful, that you probably won't be able to run your 1.14 model in a 1.13 environment. You will need to use a TF 1.14 version to execute your model, and currently the 1.14.0 version do not dynamicaly load the TRTEngineOp.

In 1.13.1 you need to do add import tensorflow.contrib.tensorrt as trt in your code to load the TRTEngineOp, and in 1.14.0, tensorrt support is no longer in contrib and the dynamic load is not enabled. You need the master branch of tensorflow (or wait for the next 1.14.1 release). Dynamic load of tensorrt is done in this commit after 1.14.0 release: tensorflow/tensorflow@408949d

from tensorrt.

EsmeYi commented on August 15, 2024

@hongym7
Well, I guess your TF-TRT didn't installed successfully. It's recommended to use the tensorflow docker container provided by Nvidia, where the TF-TRT was compiled:

docker pull nvcr.io/nvidia/tensorflow:19.06-py2

TensorFlow Release 19.06

from tensorrt.

hongym7 commented on August 15, 2024

@Eloring
Thank you for your comment.
But... Is not exist solution without docker?

from tensorrt.

EsmeYi commented on August 15, 2024

@hongym7
IBM Watson Machine Learning Community Edition 1.6.1 (also known as PowerAI) provides software packages for several deep learning frameworks, supporting libraries, and tools.
https://www.ibm.com/support/knowledgecenter/SS5SF7_1.6.1/navigation/wmlce_software_pkgs.html
https://www.ibm.com/support/knowledgecenter/SS5SF7_1.6.1/navigation/wmlce_download.html

The easiest way to get WML CE is using anaconda:

$ conda config --prepend channels https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/
$ export IBM_POWERAI_LICENSE_ACCEPT=yes

$ conda install powerai

from tensorrt.

VincentChong123 commented on August 15, 2024

Hi @Eloring, @BertrandD,

Did you try ssd_mobilenet_v1 using TF-TRT? I got 25ms for INT8 compared to 31ms for fp32 (30ms for fp16) with input resolution 300x300, batch size of 1 and using synthetic data.

Is this 1.24x speed up acceptable? I cannot find speed references from link below
http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf <- page41, batch1 int8 speed up is ~1.3x depends on algo
https://devblogs.nvidia.com/int8-inference-autonomous-vehicles-tensorrt/ <-caffe+TRT+int8 4x speedup
https://github.com/NVIDIA-AI-IOT/tf_to_trt_image_classification <- without int8 tensorrt infor

number of trt_only operations is small compared to tftrt_total, is it acceptable?
num_nodes(tftrt_total): 2885
int8: num_nodes(trt_only): 3
fp32/16: num_nodes(trt_only): 8
docker: nvcr.io/nvidia/tensorflow:19.05-py3 (NVIDIA-SMI 418.56 Python 3.5.2, tf 1.13.1)
system: ubuntu18.04, 2080ti <- support int8

precision_mode=fp32

meta_optimizer.cc:621] Optimization results for grappler item: tf_graph
meta_optimizer.cc:623]   constant folding: Graph size after: 3379 nodes (-2748), 4233 edges (-3168), time = 447.135ms.
meta_optimizer.cc:623]   layout: Graph size after: 3394 nodes (15), 4259 edges (26), time = 118.566ms.
meta_optimizer.cc:623]   constant folding: Graph size after: 3394 nodes (0), 4259 edges (0), time = 140.997ms.
meta_optimizer.cc:623]   TensorRTOptimizer: Graph size after: 2885 nodes (-509), 3656 edges (-603), time = 416.305ms.

W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:290] Engine retrieval for batch size 9000 failed. Running native segment for Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/Area/TRTEngineOp_2
	graph_size(MB)(native_tf): 27.4
	graph_size(MB)(trt): 53.3
	num_nodes(native_tf): 6127
	num_nodes(tftrt_total): 2885
	num_nodes(trt_only): 8    <- refer (1)
	time(s) (trt_conversion): 3.3262
	---------------------------------------------------------------------------
	finish frozen_graph 
		step 100/4096, iter_time(ms)=31.7493
		step 200/4096, iter_time(ms)=31.6466

(note1)num_nodes(trt_only): 8    
	TRTEngineOp_0
	Postprocessor/BatchMultiClassNonMaxSuppression/TRTEngineOp_5
	Postprocessor/TRTEngineOp_6
	Postprocessor/BatchMultiClassNonMaxSuppression/TRTEngineOp_4
	TRTEngineOp_1
	Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ClipToWindow/Area/TRTEngineOp_3
	Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/Area/TRTEngineOp_2 TRTEngineOp_7

precision_mode=int8

log
meta_optimizer.cc:621] Optimization results for grappler item: tf_graph
meta_optimizer.cc:623]   constant folding: Graph size after: 3379 nodes (-2748), 4233 edges (-3168), time = 441.048ms.
meta_optimizer.cc:623]   layout: Graph size after: 3394 nodes (15), 4259 edges (26), time = 118.754ms.
meta_optimizer.cc:623]   constant folding: Graph size after: 3394 nodes (0), 4259 edges (0), time = 142.857ms.
meta_optimizer.cc:623]   TensorRTOptimizer: Graph size after: 2894 nodes (-500), 3665 edges (-594), time = 19535.6191ms.

2019-07-09 03:23:28.745222: I tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:496] Building a new TensorRT engine for TRTEngineOp_1 with batch size 9000
graph_size(MB)(native_tf): 27.4
graph_size(MB)(trt): 53.2
num_nodes(native_tf): 6127
num_nodes(tftrt_total): 2894
num_nodes(trt_only): 3          <- refer(note2)
time(s) (trt_conversion): 23.2403
    step 100/73, iter_time(ms)=2908.3175

results:
finish frozen_graph 
    step 100/4096, iter_time(ms)=25.8242
    step 200/4096, iter_time(ms)=25.3153

(note2)num_nodes(trt_only): 3  
	TRTEngineOp_0
	Postprocessor/TRTEngineOp_6
	TRTEngineOp_1

code
precision_mode = 'INT8' 
  
    frozen_graph = optimize_model(
        config_path=config_path,
        checkpoint_path=checkpoint_path,
        use_trt=True,
        force_nms_cpu=False,  # default true
        precision_mode=precision_mode,
        max_workspace_size_bytes=1 << 32,
        maximum_cached_engines=100,
        calib_images_dir='/N/data-sata/fast-ai-coco/coco-2014/val2014',
        num_calib_images=100,
        calib_image_shape=(300,300),
        output_path="{}.output_path.{}.graph".format(config_path,precision_mode)
    )

from tftrt.examples.object_detection import benchmark_model
statistics = benchmark_model(
frozen_graph=frozen_graph,
images_dir=images_dir,
annotation_path=annotation_path,
use_synthetic=True,
image_shape=(300,300)
)

from tensorrt.

EsmeYi commented on August 15, 2024

Hi @weishengchong
Well, in my opinion, using TF-TRT to accelerate TF models can't reach the same speedup as using TRT UFF parser to build the engine.

Using TF-TRT:

TensorRT optimizes the largest subgraphs possible in the TensorFlow graph. The more compute in the subgraph, the greater benefit obtained from TensorRT. You want most of the graph optimized and replaced with the fewest number of TensorRT nodes for best performance. Based on the operations in your graph, it’s possible that the final graph might have more than one TensorRT node.

Which means each TRTEngineOp will contain a serialized subgraph GraphDef, where a subgraph contained several TF nodes in the original graph, so it is acceptable TRT-node is smaller than TF-total.

TF-TRT producing an optimized model that runs in TensorFlow for inference, if it fails to execute the TRT engine, the TRT op will fall back to call the corresponding TF function.

Using TensorFlow/UFF parser:

Converts TF graph to UFF file format (need add custom plugins for unsupported layers)
Loads the UFF model and creates the UFF parser
Builds an optimized engine
Uses the engine to perform inference in TensorRT

For me, the most challenging step is to add custom layers (code plugins in C++)....

TensorRT provides a UFF-SSD sample.
I have evaluated the sample and the result shows TRT make more than 6.7x speedup (both in FP32 mode). But for SSD-TFTRT, there is few improvement. While I also tested Faster-RCNN-TFTRT, I got 2.1x speedup.

from tensorrt.

VincentChong123 commented on August 15, 2024

Hi @Eloring,

TensorRT provides a UFF-SSD sample.
TRT make more than 6.7x speedup (both in FP32 mode).
Did you try int8 UFF-SSD?

It is reported only 1.3x speedup for GTX 1080Ti,

FP32 inference time: ~ 9 ms
INT8 inference time: ~ 6 ms

Thanks again.

from tensorrt.

EsmeYi commented on August 15, 2024

Hi @weishengchong
From original Tensorflow Frozen Graph to TensorRT Engine, TRT made 6.7x speedup, where no quantization like FP16 or INT8 was used.
That's what I mean :)
And I haven't tried int8 UFF-SSD.

from tensorrt.

ZhuoranLyu commented on August 15, 2024

Hi @weishengchong @Eloring Did you guys figure out how to use tftrt without docker? Using tftrt in docker indeed works for me. However, I'd like to use tftrt with C/C++ API in native (Windows) env. Do you know how to figure this out? Thanks.

from tensorrt.

VincentChong123 commented on August 15, 2024

Hi @ZhuoranLyu,

I only success in using docker.

I have no idea about running trt on Windows, FYI https://devtalk.nvidia.com/default/topic/1055484/tensorrt/deepstream_reference_apps-trt-yolo-app-windows-build/

Hi @Eloring thanks for your advice.

from tensorrt.

ZhuoranLyu commented on August 15, 2024

@weishengchong I ran tensorrt successfully on windows. However, I was wondering how to run tensorflow-tensorrt(tftrt) on windows.

from tensorrt.

PetreanuAndi commented on August 15, 2024

Hello guys. @Eloring , I am especially interested in this thread. I am trying to convert a SSD_Resnet50_FPN model to TFRT. Everything works just fine, i converted both saved_model and inference graph, FP16 & FP32, tried all the options (fixed input size etc), but the output i get is :

2019-08-07 13:28:26.380795: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:716] Optimization results for grappler item: tf_graph
2019-08-07 13:28:26.381027: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] constant folding: Graph size after: 2836 nodes (-1660), 4183 edges (-1854), time = 593.697ms.
2019-08-07 13:28:26.381058: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] layout: Graph size after: 2880 nodes (44), 4255 edges (72), time = 152.605ms.
2019-08-07 13:28:26.381160: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] constant folding: Graph size after: 2880 nodes (0), 4255 edges (0), time = 195.618ms.
graph_size(MB)(native_tf): 123.3
graph_size(MB)(trt): 123.2
num_nodes(native_tf): 4496
num_nodes(tftrt_total): 2880
num_nodes(trt_only): 0
time(s) (trt_conversion): 2.9199
number of TRT ops in the converted graph : 0

There are no trt_only nodes, and no TRT ops.
My original TF frozen graph had 0.0248s inference time (1080Ti)
My TFRT frozen graph has 0.0251 inference time (so slightly bigger, average on [1:1000] random images)

Is FPN or ResNet (skip connections) the cause for this failed optimization? (I mean it compiles and works, but does so slower than before optimization)
I also extract features from 4 different feature maps in the encoder, corresponding to the FPN heads. (i specified all output nodes in the conversion procedure) Maybe that's why it does not optimize well? I need these 4 outputs for a fused encoding volume that is passed to an LSTM, so that's important.

Guys, anything would help at this point, Thank you very much in advance!

from tensorrt.

EsmeYi commented on August 15, 2024

hi @PetreanuAndi , these are my logs of converting a ssd_resnet_50_fpn_coco model ( from tensorflow_model_zoo ).

2019-08-08 11:10:38.312574: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:716] Optimization results for grappler item: tf_graph
2019-08-08 11:10:38.312798: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] constant folding: Graph size after: 23991 nodes (-14028), 30371 edges (-16346), time = 3906.88501ms.
2019-08-08 11:10:38.312813: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] layout: Graph size after: 24008 nodes (17), 30401 edges (30), time = 996.894ms.
2019-08-08 11:10:38.312825: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] constant folding: Graph size after: 24008 nodes (0), 30401 edges (0), time = 1110.90796ms.
2019-08-08 11:10:38.312837: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] TensorRTOptimizer: Graph size after: 19015 nodes (-4993), 25064 edges (-5337), time = 11150.5664ms.
graph_size(MB)(native_tf): 135.2
graph_size(MB)(trt): 268.9
num_nodes(native_tf): 38019
num_nodes(tftrt_total): 19015
num_nodes(trt_only): 49
time(s) (trt_conversion): 53.7628

I noticed that there is no TensorRTOptimizer output in your logs. Did you test such a common model to validate your convert code or TF-TRT are good?
Besides, what's the inference time of FP16 model? Did quantization improve the inference performance in your results?

from tensorrt.

ZhuoranLyu commented on August 15, 2024

@PetreanuAndi Did you use nv-docker or native env with tensorflow?

from tensorrt.

PetreanuAndi commented on August 15, 2024

Hello @Eloring @ZhuoranLyu

I have pip installed tensorflow gpu 1.14 into a conda env.
I've also tried with 1.15 and the result is the same.

Should I build from source?
Even if you have trt_only nodes (49) do you actually observe a speedup? How much? Can you detail on that?

FP16 and FP32 both did not improve performance. Moreover, they seem to actually affect performance (on 1080 Ti) :
---> original graph : avg 0.0248s
---> optimized graph : avg 0.0251s

from tensorrt.

ZhuoranLyu commented on August 15, 2024

@PetreanuAndi build from source or use docker

from tensorrt.

PetreanuAndi commented on August 15, 2024

@ZhuoranLyu i'm building from source now but can you confirm that you actually have a speedup measurement (with or without trt_only nodes) ? Have you also tried SSD + FPN? (maybe the FPN aggregation has problems in the optimisation process)

I have read some other online forum that stated only C++ TRT will have speedup. Thoughts on that? Will come back with prints and benchmarking after the source build finishes.

from tensorrt.

ZhuoranLyu commented on August 15, 2024

@PetreanuAndi Firstly you can try inference with python under docker environment to see if it can speed up the model. From my perspective, SSD will benefit a lot from tftrt, especially using fp16.

from tensorrt.

EsmeYi commented on August 15, 2024

@PetreanuAndi
Same opinion as @ZhuoranLyu. Obviously, your TF-TRT environment is not installed successfully. You can try to use Docker or Anaconda to pull/install an integrated image/package, otherwise, compile them by yourself.

from tensorrt.

PetreanuAndi commented on August 15, 2024

Hello guys. @Eloring , @ZhuoranLyu

I have installed tensorflow from sources. V 1.14, Cuda 10.02, cudnn 7.4.
Installation went good, and graph conversion went well, with the following output :

2019-08-16 12:25:39.216209: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node Postprocessor/TRTEngineOp_90 added for segment 90 consisting of 3 nodes succeeded.
2019-08-16 12:25:39.217472: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_91 added for segment 91 consisting of 27 nodes succeeded.
2019-08-16 12:25:39.217606: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_92 added for segment 92 consisting of 5 nodes succeeded.
2019-08-16 12:25:39.217705: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_93 added for segment 93 consisting of 3 nodes succeeded.
2019-08-16 12:25:39.310289: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:716] Optimization results for grappler item: tf_graph
2019-08-16 12:25:39.310333: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] constant folding: Graph size after: 2836 nodes (-1660), 4183 edges (-1854), time = 728.073ms.
2019-08-16 12:25:39.310337: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] layout: Graph size after: 2880 nodes (44), 4255 edges (72), time = 143.931ms.
2019-08-16 12:25:39.310341: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] constant folding: Graph size after: 2880 nodes (0), 4255 edges (0), time = 167.55ms.
2019-08-16 12:25:39.310345: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] TensorRTOptimizer: Graph size after: 1907 nodes (-973), 3218 edges (-1037), time = 7242.89307ms.
graph_size(MB)(native_tf): 123.3
graph_size(MB)(trt): 336.9
num_nodes(native_tf): 4496
num_nodes(tftrt_total): 1907
num_nodes(trt_only): 94
time(s) (trt_conversion): 10.6979
number of TRT ops in the converted graph : 94

However, inference time is the same :)
I know plain simple SSD will benefit from tfrt, but i am using SSD + FPN.
Can you please confirm that you get an inference speedup with TFRT (when having trt_only nodes) ? Now i have those nodes but there is no speedup :)

There is another forum that says the TFRT compiled with tensorflow in python will not offer speedup, but the C++ version will.. Do you guys do yours in C++? I don't think that's a reasonable argument (people from tensorflow would not do such cheap shady implementations probably)

from tensorrt.

PetreanuAndi commented on August 15, 2024

Hey Guys. I have build tf-nightly-gpu 1.15.0.dev20190816, with TensorRT5 (directly from TF package), cuda 10.0 and cudnn 7.6.0.

It builds / optimizes without errors, and actually outputs more trt_only nodes (102 instead of 94).
Testing with FP32 yielded a poorer inference time (0.031 on average / image), but testing with FP16 did give a slight improvement over the original model (0.021s better then 0.028)

However, this improvement is still very small compared to what other people are talking about on forums (3x etc)

You still did not say : is your model compiled from the NVidia source of TRT? the C++ ? or it is installed along with tensorflow (either from source or pip)

Any other suggestions for SSD + FPN speed improvement? (this is listed as the FIRST example on the github page of tensorflow tensorrt, so i am expecting there to be some substantial improvement but all my efforts were in vain mostly)

thank you!

from tensorrt.

ZhuoranLyu commented on August 15, 2024

The model is built with tensorflow in python, optimized with tf-trt, just as the example shows. No need to implement the model with NV-Tensor RT in C++ cause I am not familiar with C++.

from tensorrt.

zhenpalapala commented on August 15, 2024

@PetreanuAndi
Same opinion as @ZhuoranLyu. Obviously, your TF-TRT environment is not installed successfully. You can try to use Docker or Anaconda to pull/install an integrated image/package, otherwise, compile them by yourself.

@hongym7
Well, I guess your TF-TRT didn't installed successfully. It's recommended to use the tensorflow docker container provided by Nvidia, where the TF-TRT was compiled:

docker pull nvcr.io/nvidia/tensorflow:19.06-py2

TensorFlow Release 19.06

Hi, I just met the same issue as you. My tf version is 1.14, tf serving version is 1.13, the os is linux and tensorRT is 5.1.I want use tensorRT to speed up my model, but the output likes this

2019-08-19 09:16:25.414934: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:716] Optimization results for grappler item: tf_graph 2019-08-19 09:16:25.414982: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] constant folding: Graph size after: 554 nodes (-256), 616 edges (-258), time = 544.394ms. 2019-08-19 09:16:25.414990: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] layout: Graph size after: 561 nodes (7), 618 edges (2), time = 118.45ms. 2019-08-19 09:16:25.414998: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] constant folding: Graph size after: 556 nodes (-5), 618 edges (0), time = 376.114ms.

It seems tensorRT doesnt work .
I try to docker pull nvcr.io/nvidia/tensorflow:19.06-py3, but meet this error
'unauthorized: authentication required'
Couled you please give me some advices?
Thanks a lot

from tensorrt.

PetreanuAndi commented on August 15, 2024

Hello @zhenpalapala

It seems that tensorflow 1.14 has some issues, i found that out by looking in another forum. Try this instead :

sudo apt-get install --no-install-recommends cuda-10-0 libcudnn7=7.6.0.64-1+cuda10.0 libcudnn7-dev=7.6.0.64-1+cuda10.0

sudo pip install tf-nightly-gpu==1.15.0.dev20190816

Using this nightly tf1.15 , try converting your model to FP16.
This is the only setup that actually improved the speed of my SSD FPN model.
Nothing else worked.
Hope this helps.

from tensorrt.

zhenpalapala commented on August 15, 2024

Thanks a lot.@PetreanuAndi
I found I met this "'unauthorized: authentication required'" issue when docker pull nvcr.io/nvidia/tensorflow:19.06-py3 just because of bad connection of internet. I found that using this container can really speed up the provided Resnet model, but it didn't work when using my own model even slow down.

My origin models are saved as CKPT files, I want to optimize TensorFlow Serving Performance with NVIDIA TensorRT.

I turned CKPT files to saved_model using code below

synth.load(args.checkpoint, modified_hp) sess=synth.session output_graph_def = tf.graph_util.convert_variables_to_constants(sess=sess,input_graph_def=sess.graph_def,output_node_names=output_node_names.split(",")) tf.saved_model.simple_save( session=sess, export_dir=args.export_dir, inputs={"input_lengths": tf.get_default_graph().get_tensor_by_name('input_lengths:0'),"split_infos":tf.get_default_graph().get_tensor_by_name('split_infos:0'),"inputs":tf.get_default_graph().get_tensor_by_name("inputs:0")}, outputs={"linear_wav_outputs":audio.inv_spectrogram_tensorflow(tf.get_default_graph().get_tensor_by_name("Tacotron_model/inference/cbhg_linear_specs_projection/projection_cbhg_linear_specs_projection/BiasAdd:0")[0],hparams)},legacy_init_op=None)
Then use docker command to use tensorrt：

docker run --rm --gpus all -it \ -v /tmp:/tmp nvcr.io/nvidia/tensorflow:19.06-py3 \ /usr/local/bin/saved_model_cli convert \ --dir 'my_saved_model' \ --output_dir 'my_saved_model_trt' \ --tag_set serve \ tensorrt --precision_mode FP16 --max_batch_size 1 --is_dynamic_op True

In the end, using docker command to put the final model on serving. But using this model doesn't work, on the contrary, the inference speed slows down and has some warning.
E external/org_tensorflow/tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] layout failed: Invalid argument: The graph is already optimized by layout optimizer. … 2019-08-21 08:01:58.396573: W external/org_tensorflow/tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:647] Engine creation for TRTEngineOp_21 failed. The native segment will be used instead. Reason: Invalid argument: Node Tacotron_model/inference/encoder_LSTM/bidirectional_rnn/bw/bw/while/encoder_bw_LSTM/BiasAdd should have an input named 'Tacotron_model/inference/encoder_LSTM/bidirectional_rnn/bw/bw/while/encoder_bw_LSTM/MatMul' but it is not available
Is this because wrong process of changing CKPT to saved_model? I cant figure out.

if anyone met this trouble before, please give me pieces of advice. Thanks a lot!

from tensorrt.

austingg commented on August 15, 2024

Recently, I am working on tf-trt on Tesla T4. I have found that ssd-like model speed up little when use tftrt-FP32 and about 2x using fp16(which use TensorCore). Beside, I have found that NMS op is running on CPU, so memcpyHtoD cost much time.

from tensorrt.

pooyadavoodi commented on August 15, 2024

TF-TRT has got a lot of improvements in 1.14. Please use that one.

The NVIDIA container that has TF1.14 is 19.07: https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#matrix

from tensorrt.

taorui-plus commented on August 15, 2024

Do you have the code you used to generate the TF-TRT version of your model? In your optimized graph, do you have any TRTEngineOp node?

len([1 for n in frozen_graph.node if str(n.op)=='TRTEngineOp'])

I also encountered the same problem，After conversion，the num of TRTEngineOp is 0：

time:{'loading_frozen_graph': 0.7235217094421387, 'trt_conversion': 11.104097127914429}
num_nodes:{'tftrt_total': 789, 'loaded_frozen_graph': 985, 'trt_only': 0}
graph_sizes:{'loaded_frozen_graph': 233293316, 'trt': 425277533}

from tensorrt.

taorui-plus commented on August 15, 2024

For NMS, if you can use combined_non_max_suppression in your graph, then you get much better speedup, esp because TF-TRT optimizes that.

If you use the object detection API, you can use the submodule of tensorflow/models to get combined_nms as follows:

The config file that you need to change for NMS is pipeline.config.

In the post_processing section of the config file, there is batch_non_max_suppression that specifies NMS configurations. Add this new field to the NMS config: combined_nms: true

hello，pooyadavoodi：
I recently wanted to try the tensorrt deployment model, see: https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#matrix , using:

    converter = trt.TrtGraphConverter(input_graph_def=frozen_graph,
                                      nodes_blacklist=['time_distributed_1/Reshape_1'],
                                      max_batch_size=1,
                                      precision_mode='FP16',
                                      is_dynamic_op=False,
                                      Shared connection to 10.64.0.11 closed.
                                      max_workspace_size_bytes=1<<32)

But the converted map is bigger：

time:{'loading_frozen_graph': 0.7235217094421387, 'trt_conversion': 11.104097127914429}
num_nodes:{'tftrt_total': 789, 'loaded_frozen_graph': 985, 'trt_only': 0}
graph_sizes:{'loaded_frozen_graph': 233293316, 'trt': 425277533}

Try to adjust the above parameters, the size of the graph and the number of nodes have not changed, what should I do next, what documents do I need to see? What is wrong with my current use?

from tensorrt.

pooyadavoodi commented on August 15, 2024

trt_only: 0 suggests no TensorRT node is created. It's impossible to tell why without looking at the log.

Could you rerun the conversion with verbose logging and post the log https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#verbose

I suppose ['time_distributed_1/Reshape_1' is the output tensor of your model?

from tensorrt.

pooyadavoodi commented on August 15, 2024

Closing. Please reopen in case you still see the issue.

from tensorrt.

anuar12 commented on August 15, 2024

I didn't get any inference boost with FP16 conversion on 2080 Ti too. The inference speed is the same.
I used TF1.14, converted keras retinanet into SavedModel.
Here are some of the code and logs if it's helpful:

minimum_segment_size = 2,
maximum_cached_engines = 100,
precision_mode = "FP16"
converter = trt.TrtGraphConverter(
        input_saved_model_dir=saved_model_dir,
        precision_mode=precision_mode,
        minimum_segment_size=minimum_segment_size,
        is_dynamic_op=True,
        max_batch_size=32,
        max_workspace_size_bytes=7000000000,
        maximum_cached_engines=maximum_cached_engines)
frozen_graph = converter.convert()

pciBusID: 0000:09:00.0
2019-10-17 11:10:26.445500: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-10-17 11:10:26.445513: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-10-17 11:10:26.445524: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-10-17 11:10:26.445534: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-10-17 11:10:26.445545: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-10-17 11:10:26.445555: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-10-17 11:10:26.445566: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-10-17 11:10:26.445626: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-17 11:10:26.446358: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-17 11:10:26.447040: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-10-17 11:10:26.447065: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-17 11:10:26.447072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2019-10-17 11:10:26.447078: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2019-10-17 11:10:26.447212: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-17 11:10:26.447948: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-17 11:10:26.449164: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8961 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:09:00.0, compute capability: 7.5)
2019-10-17 11:10:27.924940: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:716] Optimization results for grappler item: tf_graph
2019-10-17 11:10:27.924977: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718]   constant folding: Graph size after: 2433 nodes (-214), 3289 edges (-273), time = 735.86ms.
2019-10-17 11:10:27.924983: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718]   layout: Graph size after: 2490 nodes (57), 3343 edges (54), time = 141.214ms.
2019-10-17 11:10:27.924988: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718]   constant folding: Graph size after: 2474 nodes (-16), 3343 edges (0), time = 290.99ms.
graph_size(MB)(trt): 221.7
num_nodes(tftrt_total): 2474
num_nodes(trt_only): 0
time(s) (trt_conversion): 5.4649

Would be great if there was a better documentation with simple example (especially if the api has changed) so that we can debug on our own. :)

from tensorrt.

Programmerwyl commented on August 15, 2024

I found that the reason why there was no TRTEngineOp was not the code, but the hardware platform. I ran the same code on PC with mobilenet_v2. After optimization, TRTEngineOp was 0 with 426 total nodes, but I ran it on tx2 with only 3 nodes, which was much faster

from tensorrt.

Programmerwyl commented on August 15, 2024

on tx2 log info
2019-10-18 10:21:59.917225: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741] constant folding: Graph size after: 427 nodes (-262), 436 edges (-262), time = 339.122ms.
2019-10-18 10:21:59.917289: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741] layout: Graph size after: 435 nodes (8), 438 edges (2), time = 68.517ms.
2019-10-18 10:21:59.917336: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741] constant folding: Graph size after: 429 nodes (-6), 438 edges (0), time = 118.121ms.
2019-10-18 10:21:59.917401: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741] TensorRTOptimizer: Graph size after: 3 nodes (-426), 2 edges (-436), time = 57619.9141ms.

from tensorrt.

Programmerwyl commented on August 15, 2024

Generate the optimized graph on PC and run the graph on tx2
trt new
mobilenet_v2 trt 13.124145984649658
mobilenet_v2 trt 0.02857375144958496
mobilenet_v2 trt 0.025561094284057617
mobilenet_v2 trt 0.024762630462646484
mobilenet_v2 trt 0.024967432022094727
mobilenet_v2 trt 0.025426864624023438
mobilenet_v2 trt 0.027881860733032227
mobilenet_v2 trt 0.022449254989624023
mobilenet_v2 trt 0.02154541015625
mobilenet_v2 trt 0.021519184112548828
average(sec):0.034324301613701716,fps:29.1338775440901

Generate the optimized graph on TX2 and run the graph on tx2

mobilenet_v2 trt 4.066771030426025
mobilenet_v2 trt 0.01324772834777832
mobilenet_v2 trt 0.010189056396484375
mobilenet_v2 trt 0.011507987976074219
mobilenet_v2 trt 0.012037277221679688
mobilenet_v2 trt 0.009507417678833008
mobilenet_v2 trt 0.01143336296081543
mobilenet_v2 trt 0.010309219360351562
mobilenet_v2 trt 0.01093292236328125
mobilenet_v2 trt 0.012867927551269531
average(sec):0.01874608463711209,fps:53.344472691661444

from tensorrt.

Programmerwyl commented on August 15, 2024

But there is still a problem
when Generate the optimized graph on TX2 and run the graph on tx2

with tf.gfile.FastGFile(graph_path, "rb") as f:
       graph_def.ParseFromString(f.read())
       tf.import_graph_def(graph_def, name='')
it is too slow. it cost 290.40938997268677seconds 
especially this code '  tf.import_graph_def(graph_def, name='')'

from tensorrt.

Mythos-Rudy commented on August 15, 2024

@Eloring Hello, I tried your method with tensorflow 1.14, It's useful, thank you very much!
But, I am really confused that why trt_model is bigger than original_model? Especially I covert the model from FP32 to FP16 or INT8
graph_size(MB)(**native_tf**): **27.4** graph_size(MB)(**trt**): **53.2** num_nodes(native_tf): 6127 num_nodes(tftrt_total): 2894 num_nodes(trt_only): 3

from tensorrt.

Programmerwyl commented on August 15, 2024

@Mythos-Rudy
Hi ,How do you install tensorflow 1.14?
Where can I get the installation package? Can I get a website？
Thanks

from tensorrt.

EsmeYi commented on August 15, 2024

@anuar12
I am not sure whether 2080 Ti supports FP16, https://docs.nvidia.com/deeplearning/sdk/tensorrt-support-matrix/index.html#hardware-precision-matrix

from tensorrt.

Mythos-Rudy commented on August 15, 2024

@Programmerwyl
Sorry, I don't know what you mean.
I installed TensorFlow in Linux env, so I just pip install tensorflow==1.14.0

from tensorrt.

Programmerwyl commented on August 15, 2024

@Eloring
Thank you very much for your reply. I have solved the problem of slow loading of graph.
Thanks again for your solution

from tensorrt.

Programmerwyl commented on August 15, 2024

@Mythos-Rudy
I installed tensorRT through conda, and then sudo pip3 install tensorflow-gpu==1.14.0
and it does not work
The development environment is ubuntu 18.04
The computer's graphics card is 1060
the tftrt_total is zero

from tensorrt.

anuar12 commented on August 15, 2024

@Eloring Yes 2080 Ti has to support FP16 because it is 7.5 compute capability with tensor cores.

from tensorrt.

hudengjunai commented on August 15, 2024

I soved the problem by using nvidia-docker with the tensorflow19.04 container.Refereing the Tensorflow-TensorRt-user-guid ,I find that there are only two ways to install TF-TRT:using container or compiling TensorFlow with TensorRT integration from its source.

I have compile the tensorflow with tensorRT ,but there is also no speedup ,Did I compile wrong?how did you compile the Tensorflow with TensorRT ,which tf and trt version? could you please give me some tips?

from tensorrt.

Ekta246 commented on August 15, 2024

I soved the problem by using nvidia-docker with the tensorflow19.04 container.Refereing the Tensorflow-TensorRt-user-guid ,I find that there are only two ways to install TF-TRT:using container or compiling TensorFlow with TensorRT integration from its source.

are you sure there is no binary support for the TF-trt integration.
I guess the Tf-trt Github gives about Binary installation too. How about using tensorflow.python.compiler library for binary installation method if you want to avoid the bulky bezel build configuration while building the TensorFlow from the source

from tensorrt.

No speed improvements after TF-TRT optimizing about tensorrt HOT 62 CLOSED

Comments (62)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent