Comments (14)
Profile结果
-------------------------> Profiling Report <-------------------------
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::mul 370666 32332.5 12646.616698 (0.391142) 19685.930916 (0.608858) 0.026226 95.3523 0.0872283 0.341415
thread0::elementwise_add 561008 16561.7 14064.876774 (0.849241) 2496.820884 (0.150759) 0.017456 9.10075 0.0295213 0.174883
thread0::softmax 60108 12311.9 4285.399399 (0.348069) 8026.520592 (0.651931) 0.18358 6.97239 0.20483 0.130007
thread0::matmul 125225 7621.9 4999.121478 (0.655889) 2622.776986 (0.344111) 0.026823 6.93527 0.0608656 0.0804832
thread0::layer_norm 125225 6890.62 3494.377954 (0.507121) 3396.237224 (0.492879) 0.046834 6.54613 0.0550259 0.0727612
thread0::transpose2 240432 6706.17 4940.736832 (0.736745) 1765.436248 (0.263255) 0.022342 6.87583 0.0278922 0.0708136
thread0::dropout 190342 4844.09 4380.131806 (0.904222) 463.956297 (0.095778) 0.019757 1.91613 0.0254494 0.051151
thread0::reshape2 240432 1970.44 1970.427168 (0.999993) 0.013691 (0.000007) 0.005761 0.758181 0.00819542 0.0208068
thread0::scale 70126 1559.56 1413.281218 (0.906202) 146.283474 (0.093798) 0.017739 6.40694 0.0222395 0.0164682
thread0::relu 60108 1533.83 1155.052049 (0.753053) 378.773663 (0.246947) 0.020695 1.91584 0.0255178 0.0161964
thread0::stack 5009 670.06 620.412429 (0.925906) 49.647450 (0.074094) 0.104312 3.43668 0.133771 0.00707548
thread0::lookup_table 15027 499.303 409.325315 (0.819793) 89.978163 (0.180207) 0.025125 0.087498 0.0332271 0.00527238
thread0::fetch 5009 260.127 260.068816 (0.999777) 0.057970 (0.000223) 0.044476 1.01272 0.0519319 0.0027468
thread0::TensorCopy:GPU->CPU 5009 210.123 210.076980 (0.999782) 0.045869 (0.000218) 0.036519 0.996315 0.0419491 0.00221879
thread0::GpuMemcpyAsync:CPU->GPU 20036 204.476 173.512282 (0.848571) 30.963505 (0.151429) 0.007 1.77937 0.0102054 0.00215916
thread0::GpuMemcpySync:GPU->CPU 5009 203.349 194.875273 (0.958327) 8.474119 (0.041673) 0.035868 0.330009 0.0405968 0.00214726
thread0::slice 5009 153.525 136.187490 (0.887070) 17.337539 (0.112930) 0.025069 1.98001 0.0306498 0.00162114
thread0::tanh 5009 110.729 100.944808 (0.911638) 9.784189 (0.088362) 0.017612 1.92101 0.022106 0.00116924
thread0::feed 20036 57.2692 57.263221 (0.999896) 0.005970 (0.000104) 0.00156 0.032812 0.00285831 0.000604732
from benchmark.
TImeline分析
-
很多的reshape和transpose操作
-
softmax是否使用cudnn?
from benchmark.
优化方案
- 多个lookup_table操作的融合,@Xreki due to 2019年8月30日
- 确定模型中的reshape操作是否必须的,是否可以移除。很多reshape的输入输出shape看起来是一样的。@Xreki —— 见#165 (comment)
- 不能直接移除
- 预测时,dropout可转变成scale,@Xreki due to 2019年8月22日 —— 见PaddlePaddle/Paddle#19297
- multi-attention head多条分支合并,@zhaoyuchen2018 due to 2019年8月30日
- elementwise + activation的计算,已有pass和融合的op,可能需要增强下 @Xreki
- #165 (comment)
- 融合到fc_op中,#165 (comment) ,性能提升2.1% (2019年9月10日)
- FC的GPU kernel:PaddlePaddle/Paddle#19687
- FC的融合Pass:PaddlePaddle/Paddle#19733
-
elementwise + softmax融合,@zhaoyuchen2018
-
elementwise + layer_norm融合,@Xreki
- fc+elementwise_add+layer_norm融合,PaddlePaddle/Paddle#19776 , #165 (comment) (2019年9月11日)
- fc+elementwise_add+layer_norm融合,PaddlePaddle/Paddle#19776 , #165 (comment) (2019年9月11日)
-
LN+FC可以融合,融合以后,LN的gamma和beta都可以删除
- FC(LN(X)) => W2*(W1X+B1)+B2 => W2W1X+W2B1+B2
- W2W1、W2B1+B2都可以提前计算出来,该部分最后变成了一个FC
from benchmark.
Intel相关的一些工作
- 移除attention模块中的reshape和transpose op,PaddlePaddle/Paddle#16342
- 扩展matmul,以支持multi-head,PaddlePaddle/Paddle#18570
from benchmark.
确定模型中的reshape操作是否必须的,是否可以移除。很多reshape的输入输出shape看起来是一样的。@Xreki
I0814 10:13:53.083331 11085 operator.cc:170] CPUPlace Op(reshape2), inputs:{Shape[], ShapeTensor[], X[fc_0.tmp_1:float[1, 128, 768]({})]}, outputs:{Out[fc_0.tmp_1:float[1, 128, 768]({})], XShape[reshape2_0.tmp_0:[-1]({{}})]}.
I0814 10:13:53.083364 11085 operator.cc:993] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CPUPlace]:library_type[PLAIN]
I0814 10:13:53.083400 11085 tensor_util.cu:28] TensorCopy 1, 128, 12, 64 from CPUPlace to CPUPlace
I0814 10:13:53.083407 11085 tensor_util.cu:43] Skip copy the same data async from CPUPlace to CPUPlace
I0814 10:13:53.083420 11085 operator.cc:190] CPUPlace Op(reshape2), inputs:{Shape[], ShapeTensor[], X[fc_0.tmp_1:float[1, 128, 12, 64]({})]}, outputs:{Out[fc_0.tmp_1:float[1, 128, 12, 64]({})], XShape[reshape2_0.tmp_0:[-1]({{}})]}.
- 从详细的log来看,reshape2输入的dims是[1, 128, 768],输出dims是[1, 128, 12, 64],因此不能直接移除。
- 因为reshape2是inplace执行的,导致图中展示输入的dims不正确。
结论:不能直接移除
from benchmark.
优化效果汇总
- 单位:ms/sample
版本 | P40时间 (Total) | 时间 (去掉第一次run) | 测试时间 | PR | 版本描述 | 加速效果 |
---|---|---|---|---|---|---|
0 | 8.36 | 2019-08-14 | - | baseline | - | |
1 (merged) | 8.22 | 8.14 | 2019-08-22 | PaddlePaddle/Paddle#19297 | 移除dropout_op | 1.7% |
2 (merged) | 7.37 | 6.96 | 2019-08-22 | PaddlePaddle/Paddle#19346 | softmax使用cudnn | 10% |
3 (merged) | 7.20 | 6.82 | 2019-09-10 | PaddlePaddle/Paddle#19687 PaddlePaddle/Paddle#19733 |
融合FC计算,见#165 (comment) | 2.1% |
4 (merged) | 6.93 | 6.55 | 2019-09-11 | PaddlePaddle/Paddle#19776 | 融合FC+elementwise_add+layer_norm,见#165 (comment) | 4% |
5 (merged) | 6.13 | 5.73 | 2019-09-19 | PaddlePaddle/Paddle#20167 PaddlePaddle/Paddle#20151 | 融合multi-head attention,见#165 (comment) | 14% |
3 | 7.26 | 6.88 | 2019-09-03 | PaddlePaddle/Paddle#19610 | 优化stack_op,见#165 (comment) | 1.2% |
from benchmark.
在版本1的基础上,在预测了使用了fuse_elewise_add_act_pass
,可以融合elementwise_add
+relu
的计算,并成功匹配到14个子图,但总体性能下降。
+ /paddle/build_paddle/build_docker_manylinux_cuda90/paddle/fluid/inference/tests/api/samples/ernie_tester --logtostderr --model_dir=/data/ernie/model --data=/data/ernie/seq128_data/test_ds --repeat=1 --warmup_steps=1 --use_gpu=true --use_analysis=true --print_outputs=false --profile=false
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from ernie
[ RUN ] ernie.profile
--- Running analysis [ir_graph_build_pass]
--- Running analysis [ir_graph_clean_pass]
--- Running analysis [ir_analysis_pass]
--- Running IR pass [is_test_pass]
--- Running IR pass [simplify_with_basic_ops_pass]
--- Running IR pass [fuse_elewise_add_act_pass]
--- detected 14 subgraphs
--- Running IR pass [conv_affine_channel_fuse_pass]
--- Running IR pass [conv_eltwiseadd_affine_channel_fuse_pass]
--- Running IR pass [conv_bn_fuse_pass]
--- Running IR pass [conv_eltwiseadd_bn_fuse_pass]
--- Running IR pass [conv_elementwise_add_act_fuse_pass]
--- Running IR pass [conv_elementwise_add2_act_fuse_pass]
--- Running IR pass [conv_elementwise_add_fuse_pass]
--- Running IR pass [transpose_flatten_concat_fuse_pass]
--- Running IR pass [runtime_context_cache_pass]
--- Running analysis [ir_params_sync_among_devices_pass]
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0828 08:49:37.436837 7631 ir_params_sync_among_devices_pass.cc:41] Sync params from CPU to GPU
--- Running analysis [adjust_cudnn_workspace_size_pass]
--- Running analysis [inference_op_replace_pass]
--- Running analysis [ir_graph_to_program_pass]
I0828 08:49:37.726634 7631 analysis_predictor.cc:475] == optimize end ==
I0828 08:49:38.710613 7631 ernie_tester.cc:210] Load 5010 samples from /data/ernie/seq128_data/test_ds
W0828 08:49:39.109542 7631 device_context.cc:261] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.1, Runtime API Version: 9.0
W0828 08:49:39.115780 7631 device_context.cc:269] device: 0, cuDNN Version: 7.3.
W0828 08:49:39.116070 7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.116091 7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.116104 7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
W0828 08:49:39.127769 7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.127822 7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.127832 7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
W0828 08:49:39.137923 7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.137972 7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.137981 7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
W0828 08:49:39.147920 7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.147967 7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.147977 7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
W0828 08:49:39.157951 7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.157996 7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.158005 7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
I0828 08:50:21.810214 7631 ernie_tester.cc:345] Run 5010 samples, average latency: 8.60096 ms per sample.
I0828 08:50:21.810283 7631 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 8.51942 ms per sample.
[ OK ] ernie.profile (46355 ms)
[----------] 1 test from ernie (46355 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (46356 ms total)
[ PASSED ] 1 test.
from benchmark.
融合3个path的fc,reshape,transpose,scale到一个path. P40上测试结果如下:
profile结果
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::mul 250450 30086.2 10700.311709 (0.355655) 19385.885160 (0.644345) 0.027581 88.3573 0.120129 0.39598
thread0::elementwise_add 440792 13893.1 11593.975031 (0.834514) 2299.111468 (0.165486) 0.019193 12.8221 0.0315185 0.182854
thread0::matmul 125225 7716.83 5090.761228 (0.659696) 2626.072493 (0.340304) 0.02993 7.32385 0.0616237 0.101565
thread0::layer_norm 125225 7091.46 3754.750949 (0.529475) 3336.704465 (0.470525) 0.048728 7.12371 0.0566297 0.0933344
thread0::softmax 60108 4259.71 1991.152624 (0.467438) 2268.559610 (0.532562) 0.063841 9.24836 0.0708676 0.0560643
thread0::transpose2 120216 4126.7 2756.571446 (0.667985) 1370.123630 (0.332015) 0.025956 9.50509 0.0343273 0.0543136
thread0::split 60108 2998.5 1830.882725 (0.610599) 1167.620505 (0.389401) 0.043009 19.4484 0.0498853 0.0394649
thread0::relu 60108 1620.75 1233.311364 (0.760953) 387.433865 (0.239047) 0.021955 9.30877 0.0269639 0.0213315
thread0::reshape2 120216 1279.38 1279.361341 (0.999987) 0.016020 (0.000013) 0.008744 1.83394 0.0106423 0.0168386
thread0::stack 5009 673.79 610.591735 (0.906205) 63.198156 (0.093795) 0.110177 11.1131 0.134516 0.0088681
thread0::lookup_table 15027 517.377 424.928699 (0.821314) 92.448017 (0.178686) 0.025679 6.76539 0.0344298 0.00680947
thread0::fetch 5009 342.348 342.270593 (0.999773) 0.077819 (0.000227) 0.057001 1.4229 0.0683467 0.00450583
thread0::TensorCopy:GPU->CPU 5009 261.036 260.979443 (0.999784) 0.056311 (0.000216) 0.04429 0.284401 0.0521133 0.00343563
thread0::scale 10018 253.319 237.712034 (0.938391) 15.606845 (0.061609) 0.020368 0.161032 0.0252864 0.00333406
thread0::GpuMemcpyAsync:CPU->GPU 20036 248.089 208.310791 (0.839661) 39.778466 (0.160339) 0.008811 1.84374 0.0123822 0.00326523
thread0::GpuMemcpySync:GPU->CPU 5009 241.886 232.381808 (0.960708) 9.504255 (0.039292) 0.040716 0.390521 0.0482903 0.00318359
thread0::slice 5009 160.703 143.249188 (0.891391) 17.453703 (0.108609) 0.02768 0.07693 0.0320828 0.0021151
thread0::tanh 5009 117.589 107.550767 (0.914632) 10.038396 (0.085368) 0.01952 0.884425 0.0234756 0.00154765
thread0::feed 20036 90.2748 90.263520 (0.999875) 0.011311 (0.000125) 0.002949 0.964233 0.00450563 0.00118815
from benchmark.
stack
与concat
的功能类似。尝试:
方法一:用concat
代替stack
,运行出现错误,发现stack
和concat
输出的维度不一样,不可行
stack
输出的维度
I0820 09:17:21.934635 17390 operator.cc:168] CUDAPlace(0) Op(stack), inputs:{X[scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({})]}, outputs:{Y[stack_0.tmp_0:float[1, 12, 128, 128]({})]}.
I0820 09:17:21.934787 17390 operator.cc:188] CUDAPlace(0) Op(stack), inputs:{X[scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({})]}, outputs:{Y[stack_0.tmp_0:float[1, 12, 128, 128]({})]}.
.
.
.
I0820 09:17:21.935680 17390 operator.cc:168] CUDAPlace(0) Op(elementwise_add), inputs:{X[matmul_1.tmp_0:float[1, 12, 128, 128]({})], Y[stack_0.tmp_0:float[1, 12, 128, 128]({})]}, outputs:{Out[tmp_2:float[1, 12, 128, 128]({})]}.
I0820 09:17:21.935708 17390 operator.cc:188] CUDAPlace(0) Op(elementwise_add), inputs:{X[matmul_1.tmp_0:float[1, 12, 128, 128]({})], Y[stack_0.tmp_0:float[1, 12, 128, 128]({})]}, outputs:{Out[tmp_2:float[1, 12, 128, 128]({})]}.
concat
输出的维度
I0903 05:49:33.774837 17211 operator.cc:165] CUDAPlace(0) Op(concat), inputs:{X[scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({})]}, outputs:{Out[stack_0.tmp_0:[0]({{}})]}.
I0903 05:49:33.774863 17211 operator.cc:988] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I0903 05:49:33.774976 17211 operator.cc:185] CUDAPlace(0) Op(concat), inputs:{X[scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({})]}, outputs:{Out[stack_0.tmp_0:float[1, 1536, 128]({})]}.
.
.
.
I0903 05:49:33.775012 17211 operator.cc:165] CUDAPlace(0) Op(elementwise_add), inputs:{X[matmul_1.tmp_0:float[1, 12, 128, 128]({})], Y[stack_0.tmp_0:float[1, 1536, 128]({})]}, outputs:{Out[tmp_2:[0]({{}})]}.
方法二:参考concat
,重新实现stack
I0903 09:57:36.458448 29283 ernie_tester.cc:345] Run 5010 samples, average latency: 7.26373 ms per sample.
I0903 09:57:36.458526 29283 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 6.87974 ms per sample.
以上测试结果基于版本2。
I0920 09:15:41.505820 25415 ernie_tester.cc:345] Run 5010 samples, average latency: 6.92085 ms per sample.
I0920 09:15:41.505903 25415 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 6.51521 ms per sample.
以上测试结果基于版本4。
from benchmark.
实现fc的GPU kernel :PaddlePaddle/Paddle#19687
FC的融合:PaddlePaddle/Paddle#19733
- fc=mul+elementwise_add+relu,主要是将fc中的elementwise_add和relu计算融合起来。
- 测试结果,加速效果:2.1%
--- Running IR pass [fc_fuse_pass]
--- detected 12 subgraphs
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0909 06:52:40.161475 7826 fc_fuse_pass.cc:122] with_relu: 1, found_fc_count: 12
--- detected 62 subgraphs
I0909 06:52:40.214682 7826 fc_fuse_pass.cc:122] with_relu: 0, found_fc_count: 74
.
.
.
I0909 06:51:05.903086 7821 ernie_tester.cc:345] Run 5010 samples, average latency: 7.19529 ms per sample.
I0909 06:51:05.903168 7821 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 6.81763 ms per sample.
- Profile
-------------------------> Profiling Report <-------------------------
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::fc 370666 37927.8 16796.211849 (0.442847) 21131.552513 (0.557153) 0.038417 81.8839 0.102323 0.511083
thread0::matmul 125225 7542.57 4853.428136 (0.643472) 2689.138129 (0.356528) 0.028322 6.91776 0.0602321 0.101637
thread0::layer_norm 125225 6785.64 3458.361240 (0.509659) 3327.277901 (0.490341) 0.046506 6.78461 0.0541876 0.0914376
thread0::transpose2 240432 6648.82 4993.952105 (0.751103) 1654.869939 (0.248897) 0.022683 6.97236 0.0276536 0.089594
thread0::elementwise_add 190342 4846.03 3903.958820 (0.805599) 942.072546 (0.194401) 0.018709 7.06821 0.0254596 0.0653011
thread0::softmax 60108 4155.86 1873.948821 (0.450917) 2281.914897 (0.549083) 0.062806 7.22235 0.0691399 0.056001
thread0::reshape2 240432 2117.24 2117.212819 (0.999986) 0.030675 (0.000014) 0.006925 2.54413 0.008806 0.0285302
thread0::scale 70126 1639.13 1402.350356 (0.855547) 236.777127 (0.144453) 0.017914 3.11279 0.023374 0.0220875
thread0::stack 5009 651.86 601.077905 (0.922097) 50.782113 (0.077903) 0.106259 4.12809 0.130138 0.00878393
thread0::lookup_table 15027 479.71 388.084297 (0.808999) 91.625230 (0.191001) 0.024586 0.085041 0.0319232 0.00646417
thread0::fetch 5009 333.739 333.664085 (0.999776) 0.074696 (0.000224) 0.054763 2.98519 0.0666278 0.00449719
thread0::TensorCopy:GPU->CPU 5009 264.276 264.216676 (0.999777) 0.059003 (0.000223) 0.043246 2.97079 0.0527602 0.00356116
thread0::GpuMemcpySync:GPU->CPU 5009 244.643 235.804134 (0.963870) 8.839005 (0.036130) 0.04009 1.58306 0.0488407 0.00329661
thread0::GpuMemcpyAsync:CPU->GPU 20036 226.793 195.272127 (0.861014) 31.521131 (0.138986) 0.007882 1.17277 0.0113193 0.00305608
thread0::slice 5009 147.886 130.195979 (0.880383) 17.689569 (0.119617) 0.025414 0.103978 0.029524 0.00199278
thread0::tanh 5009 122.009 112.002519 (0.917989) 10.006004 (0.082011) 0.019717 0.094528 0.0243579 0.00164409
thread0::feed 20036 76.5896 76.581383 (0.999892) 0.008248 (0.000108) 0.002877 0.308487 0.0038226 0.00103206
from benchmark.
fc+elementwise_add+layer_norm融合
--- Running IR pass [fc_elementwise_layernorm_fuse_pass]
--- detected 24 subgraphs
.
.
.
I0911 06:41:55.494792 7238 ernie_tester.cc:345] Run 5010 samples, average latency: 6.92849 ms per sample.
I0911 06:41:55.494873 7238 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 6.54723 ms per sample.
- Profile
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::fc 255459 24269.7 12361.669096 (0.509346) 11908.043273 (0.490654) 0.039433 100.758 0.0950043 0.346056
thread0::fused_fc_elementwise_layernorm 115207 17699.9 6439.933305 (0.363841) 11259.921061 (0.636159) 0.103025 6.90034 0.153635 0.252378
thread0::matmul 125225 7670.31 4985.129880 (0.649925) 2685.183125 (0.350075) 0.029412 6.91239 0.0612522 0.109369
thread0::transpose2 240432 6889.05 5254.959850 (0.762799) 1634.087984 (0.237201) 0.023907 6.72677 0.0286528 0.0982293
thread0::softmax 60108 4204.46 1927.686603 (0.458486) 2276.771181 (0.541514) 0.064249 2.83378 0.0699484 0.0599504
thread0::reshape2 240432 2228.86 2228.844417 (0.999994) 0.014348 (0.000006) 0.007575 3.75591 0.00927023 0.0317808
thread0::elementwise_add 75135 2160.14 1641.503732 (0.759904) 518.641058 (0.240096) 0.019577 6.72642 0.0287502 0.030801
thread0::scale 70126 1736.05 1498.515875 (0.863178) 237.529504 (0.136822) 0.018853 6.66338 0.0247561 0.0247539
thread0::stack 5009 664.48 615.254962 (0.925920) 49.224601 (0.074080) 0.10776 3.887 0.132657 0.00947466
thread0::layer_norm 10018 592.675 324.577447 (0.547648) 268.097787 (0.452352) 0.050985 6.49622 0.059161 0.00845082
thread0::lookup_table 15027 499.198 412.769395 (0.826866) 86.428197 (0.173134) 0.025502 6.66372 0.03322 0.00711794
thread0::fetch 5009 362.799 362.676040 (0.999660) 0.123295 (0.000340) 0.05508 0.625043 0.0724295 0.00517307
thread0::TensorCopy:GPU->CPU 5009 297.886 297.782095 (0.999652) 0.103633 (0.000348) 0.043194 0.612426 0.0594701 0.00424748
thread0::GpuMemcpySync:GPU->CPU 5009 278.55 270.104589 (0.969679) 8.445858 (0.030321) 0.039548 0.609273 0.05561 0.00397179
thread0::GpuMemcpyAsync:CPU->GPU 20036 230.007 198.920144 (0.864846) 31.086396 (0.135154) 0.008387 0.708778 0.0114797 0.00327961
thread0::slice 5009 152.385 135.526127 (0.889365) 16.859146 (0.110635) 0.026901 0.06877 0.0304223 0.00217283
thread0::tanh 5009 122.456 112.432469 (0.918148) 10.023237 (0.081852) 0.020671 0.070096 0.0244471 0.00174607
thread0::feed 20036 73.4209 73.411569 (0.999873) 0.009355 (0.000127) 0.002908 0.042732 0.00366445 0.00104689
from benchmark.
multi-head attention融合
--- Running analysis [ir_graph_build_pass]
--- Running analysis [ir_graph_clean_pass]
--- Running analysis [ir_analysis_pass]
--- Running IR pass [cudnn_placement_pass]
--- Running IR pass [is_test_pass]
--- Running IR pass [simplify_with_basic_ops_pass]
--- Running IR pass [conv_affine_channel_fuse_pass]
--- Running IR pass [conv_eltwiseadd_affine_channel_fuse_pass]
--- Running IR pass [conv_bn_fuse_pass]
--- Running IR pass [conv_eltwiseadd_bn_fuse_pass]
--- Running IR pass [multihead_matmul_fuse_pass]
I1120 01:40:10.299883 6510 graph_pattern_detector.cc:97] --- detected 12 subgraphs
--- Running IR pass [fc_fuse_pass]
I1120 01:40:10.344569 6510 graph_pattern_detector.cc:97] --- detected 12 subgraphs
I1120 01:40:10.357105 6510 graph_pattern_detector.cc:97] --- detected 26 subgraphs
--- Running IR pass [fc_elementwise_layernorm_fuse_pass]
I1120 01:40:10.397455 6510 graph_pattern_detector.cc:97] --- detected 24 subgraphs
--- Running IR pass [conv_elementwise_add_act_fuse_pass]
--- Running IR pass [conv_elementwise_add2_act_fuse_pass]
--- Running IR pass [conv_elementwise_add_fuse_pass]
--- Running IR pass [transpose_flatten_concat_fuse_pass]
--- Running IR pass [runtime_context_cache_pass]
--- Running analysis [ir_params_sync_among_devices_pass]
I1120 01:40:10.509421 6510 ir_params_sync_among_devices_pass.cc:41] Sync params from CPU to GPU
--- Running analysis [adjust_cudnn_workspace_size_pass]
--- Running analysis [inference_op_replace_pass]
--- Running analysis [ir_graph_to_program_pass]
I1120 01:40:10.758424 6510 analysis_predictor.cc:474] ======= optimize end =======
I1120 01:40:11.764863 6510 ernie_tester.cc:210] Load 5010 samples from /data/ernie/seq128_data/test_ds
W1120 01:40:12.444176 6510 device_context.cc:236] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.1, Runtime API Version: 9.0
.
.
.
I1120 01:40:43.862216 6510 ernie_tester.cc:345] Run 5010 samples, average latency: 6.40495 ms per sample.
I1120 01:40:43.862303 6510 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 5.74544 ms per sample.
- Profile
-------------------------> Profiling Report <-------------------------
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::fused_fc_elementwise_layernorm 115207 17021.4 5933.953144 (0.348617) 11087.450962 (0.651383) 0.09927 8.71612 0.147746 0.259676
thread0::multihead_matmul 60108 11252.9 7224.504494 (0.642012) 4028.405127 (0.357988) 0.169878 6.57812 0.187212 0.171673
thread0::fc 75135 10707.1 4357.612778 (0.406985) 6349.445604 (0.593015) 0.038466 85.6766 0.142504 0.163345
thread0::mul 180324 9774.02 4994.708598 (0.511019) 4779.307233 (0.488981) 0.047804 9.68787 0.0542025 0.149111
thread0::fetch 5009 4630.95 4629.392276 (0.999664) 1.554398 (0.000336) 0.055675 1.92049 0.924525 0.0706489
thread0::TensorCopy:GPU->CPU 5009 4573.44 4571.905707 (0.999664) 1.538373 (0.000336) 0.04468 1.90973 0.913045 0.0697717
thread0::GpuMemcpySync:GPU->CPU 5009 4553.78 4543.954556 (0.997843) 9.821141 (0.002157) 0.040962 1.9062 0.909119 0.0694716
thread0::stack 5009 684.909 634.119137 (0.925844) 50.789981 (0.074156) 0.106366 2.25957 0.136736 0.0104489
thread0::layer_norm 10018 549.522 290.654864 (0.528923) 258.867095 (0.471077) 0.048854 0.142508 0.0548535 0.00838342
thread0::lookup_table 15027 479.099 392.001286 (0.818205) 87.097997 (0.181795) 0.024349 9.00457 0.0318826 0.00730906
thread0::elementwise_add 15027 345.643 302.006333 (0.873752) 43.636814 (0.126248) 0.017625 3.1705 0.0230015 0.00527307
thread0::GpuMemcpyAsync:CPU->GPU 20036 229.869 198.181250 (0.862150) 31.687294 (0.137850) 0.008188 2.08116 0.0114728 0.00350684
thread0::scale 10018 224.898 210.406303 (0.935565) 14.491370 (0.064435) 0.018757 0.143304 0.0224494 0.003431
thread0::slice 5009 169.833 151.921321 (0.894535) 17.911378 (0.105465) 0.030184 0.072767 0.0339055 0.00259094
thread0::matmul 5009 163.061 149.402692 (0.916239) 13.658130 (0.083761) 0.028052 0.956259 0.0325536 0.00248763
thread0::tanh 5009 116.672 106.307159 (0.911166) 10.364377 (0.088834) 0.020194 0.067294 0.0232924 0.00177992
thread0::feed 20036 71.6392 71.630728 (0.999882) 0.008476 (0.000118) 0.002808 0.034718 0.00357552 0.00109292
from benchmark.
QA测试结果
- 测试时间:2019年12月24日
- 测试环境:
- CUDA 9.0
- CUDNN 7.3
- Paddle预测库版本: a5a8d14414213fadcfcd7dc60c794d1a515a390e
- 测试镜像: paddlepaddle/paddle_manylinux_devel:cuda9.0_cudnn7
- GPU: Tesla P4 8G(7611MB)
- GCC版本:镜像gcc482
- 测试结论:
- ERNIE-GPU预测性能符合预期
- GPU和CPU对相同数据的预测结果diff<1e-5
- GPU同⼀个数据集反复预测,预测结果diff<1e-6
- 性能数据:
-
Seqlen128-data
batch_size average_latency(ms) batch_latency(ms) gpu_memory(MB) 1 8.664 8.664 737 10 6.388 63.878 825 20 6.207 124.149 919 30 6.530 195.892 985 40 6.456 258.260 1085 50 6.445 322.275 1135 60 6.430 385.805 1279 70 6.457 451.965 1283 80 6.415 513.178 1339 90 6.481 583.301 1431 100 6.504 650.428 1485 -
Seqlen512-data
batch_size average_latency(ms) batch_latency(ms) gpu_memory(MB) 1 36.123 36.123 781 10 31.663 316.630 1191 20 31.750 635.002 1637 30 32.280 968.412 2053 40 32.130 1285.212 2431 50 32.917 1645.835 2847 100 34.254 3425.380 4951 200 显存不⾜ 300 显存不⾜
-
from benchmark.
二期优化工作
-
负责人: @NHZlX
-
GPU平台信息:Tesla P4
-
软件信息:
- CUDA 10.0
-
优化工作
版本 | P4时间 (ms) | 记录时间 | PR | 版本描述 | 加速效果 |
---|---|---|---|---|---|
0 | 10 | 2020-02-04 | - | baseline,CUDA10的时间比CUDA9的时间长 | - |
1 | 9.5 | 2020-02-04 | PaddlePaddle/Paddle#22486 |
将3个fc也融合到了multihead_matmul_op中,将3个小的GEMM合成一个大的GEMM计算 | 5% |
2 | 8.4 | 2020-02-04 | PaddlePaddle/Paddle#22494 |
融合3个lookup_table+elementwise_add+layer_norm,实现fused_embedding_eltwise_layernorm_op | 11.6% |
3 | 7.8 | 2020-02-05 | fused_fc_elementwise_layernorm_op中计算layer_norm,使用double类型进行block内的reduce操作,改成float类型 |
embedding_eltwise_layernorm_fuse_pass匹配的子图模式
P4的浮点计算能力
from benchmark.
Related Issues (20)
- CUDA device is not set properly
- Optimize the performance of Transformer-Big on 1 V100 GPU HOT 10
- Optimize the performance of PyramidDNN on CPU HOT 38
- SEResnet50 模型定义问题 HOT 1
- Optimize the performance of seq2seq model on GPU HOT 2
- Missing definition of the OpenACC API routine/s in the OpenACC library linked to the application
- AttributeError: module 'paddle' has no attribute 'fluid'
- OP Benchmark 测试算子性能的时候无法使用nsight system HOT 2
- Optimize inference performance of ERNIE on CPU HOT 51
- Optimize train performance of StarGAN on V100 GPU HOT 4
- Optimize the performance of Cascade On V100 HOT 7
- Retinanet performance improvement On V100 HOT 2
- Optimize train performance of STGAN on V100 GPU HOT 3
- mask-rcnn-fpn optimize HOT 2
- Optimize inference performance of ERNIE INT8 on CPU HOT 26
- The design and optimization of API Benchmark HOT 1
- Optimize the performance of CQDNN on CPU HOT 3
- Optimize XLnet performance HOT 7
- MobilenetV1 profile for optimizing HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from benchmark.