Giter VIP home page Giter VIP logo

Comments (26)

wojtuss avatar wojtuss commented on May 22, 2024 3

Below are our latest performance results for Ernie FP32 and INT8 runs. The tests were run with affinity settings

export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1

on CLX 6248. INT8 tests were run with the memory_optimize_pass disabled, due to PaddlePaddle/Paddle#21492.

With the current develop branch (1fd1f06) and the origin FP32 model, the latency was:
FP32, 20 threads: 29.10 ms,
FP32, 1 thread: 180.95 ms.

With additional commits from PRs

the latency for INT8 (with FC quantized only) was:
INT8, 20 threads: 23.73 ms (18.45% faster than FP32),
INT8, 1 thread: 84.32 ms (53.40% faster than FP32).
After enabling INT8 kernels of reshape2 and transpose2 (support already present in develop branch), the results were:
INT8, 20 threads: 20.47 ms (29.68% faster than FP32),
INT8, 1 thread: 80.64 ms (55.43% faster than FP32).

With additional optimizations possible to implement:

  • avoiding reorder nwc->ncw before layer_norm (here imitated by removing the reorder),
  • fusing scale ops into fc or into dequantize (here imitated by removing the scale ops),

the latency for INT8 (with FC quantized only) was:
INT8, 20 threads: 20.68 ms (28.93% faster than FP32),
INT8, 1 thread:: 77.84 ms (56.99% faster than FP32).
With INT8 kernels also for reshape2 and transpose2, the latency was:
INT8, 20 threads: 17.39 ms (40.24% faster than FP32),
INT8, 1 thread: 75.66 ms (58.19% faster than FP32).

Other optimizations, like quantization of more operators and more fuses, are also investigated.

from benchmark.

wojtuss avatar wojtuss commented on May 22, 2024 3

We have finally resolved the accuracy issue for Ernie INT8 run.

Background:
FC operators in the Ernie model have 2-dimensional weights and receive 3-dimensional inputs. For the MKL-DNN inner product to be used it requires transforming the weights and inputs into proper 3-dimensional tensors. As long as the input is in NCHW format (for FP32), the transformation is possible and Ernie FP32 achieves good accuracy. However, when the input is in NHWC format (for INT8), transforming the input and the weights into proper tensors requires modifications of strides, which are not supported by the MKL-DNN inner product.

Solution:
The NHWC format of INT8 input data is beneficial for conv2d INT8 kernel (because of the nature of the conv2d's operations), but for the FC the NCHW format remains optimal also for INT8. The solution is to add support for an NCHW INT8 output of the quantize operator and use it before FC INT8. This fixes the accuracy problem. Having applied the fix we got 0.795582 accuracy for Ernie INT8 (on an SKX, on a CLX it will probably be even better; for the original QAT model, we got 0.79759 on a CLX). A PR with a fix and benchmark results will be prepared next week.

from benchmark.

wojtuss avatar wojtuss commented on May 22, 2024 3

With the fix for FC INT8 (PR PaddlePaddle/Paddle#22404, branch Sand3r-:mgallus/3d-fc-acc) we got the following results for Ernie on a CLX 6248 machine:

  accuracy latency (ms) 1 thread latency (ms) 20 threads
FP32 0.802008 190.1 31.2
INT8 0.799598 70.6 17.3
QAT FP32 0.797590    

from benchmark.

bingyanghuang avatar bingyanghuang commented on May 22, 2024 2

Schedule plan of ERNIE INT8 optimization:

ERNIE INT8 Task Status Plan to Finish Risk Developer
Accuracy check in UT WIP 12.06 Low Asia
MKL-DNN inner product performance benchmark UT benchmark  done 12.06 Low Wojtuss
3D support in FC INT8 kernel WIP 12.13 High Michal
Fusion: FC INT8 Not started 12.13 Dependency -
Fusion: FC+scale WIP 12.13 Dependency Danqing
Reshape2 + Transpose2  INT8 kernel verification WIP 12.13 Dependency Asia & Wojtuss
Overall performance & accuracy tuning Not started 12.20 Dependency -

from benchmark.

Meiyim avatar Meiyim commented on May 22, 2024 1

The output float number is the logits of a 3-label classification task. and the label file consists of class_id of the true label.
so you can evaluate the accuracy with python code below:

import sys
import numpy as np

all, acc = 0, 0
for i in sys.stdin:
        a, b, c, d = i.strip().split('\t')
        pred = np.array(map(float, (a, b, c))).argmax()
        if pred == int(d):
                acc += 1
        all += 1
print(float(acc)/all)

run with:

paste score_3_columns label.xnli.dev | python eval.py

from benchmark.

wojtuss avatar wojtuss commented on May 22, 2024 1

The latest FP32 results for the current clean develop branch (25e765a4fe) on SKX 6148:

  • 4-dimensional input (fp32_model, test_ds):
    1 thread: 189.39 ms,
    20 threads: 30.20 ms.
  • 2-dimensional input (origin, 1.8w.bs1):
    1 thread: 192.31 ms,
    20 threads: 33:90 ms.

After merging the PRs PaddlePaddle/Paddle#21746 and PaddlePaddle/Paddle#21754, the FP32 results are basically the same:

  • 2-dimensional -input (origin, 1.8w.bs1):
    1 thread: 194.45 ms,
    20 threads: 33.93 ms.

from benchmark.

wojtuss avatar wojtuss commented on May 22, 2024

We are currently working with the following two models: ernie_quant.zip and fp32_model.tar.gz.
Results gathered with the optimized fp32_model model will be labeled FP32.
Results gathered with the fake quantized FP32 ernie_quant model will be labeled QAT FP32.
Results gathered with an optimized INT8 model obtained from ernie_quant model will be labeled QAT INT8.

Initial performance results using the former (for FP32, 4 inputs, test_ds dataset) and the latest (for QAT FP32/INT8, 2 inputs, 1.8w.bs1 dataset) unit test obtained from Baidu.
(20 threads, bs 1, CLX 6248)
FP32:
Run 5010 samples, average latency: 55.7539 ms per sample.
Run 5009 samples, average latency [exclude 1 warmup steps]: 55.7415 ms per sample.
QAT FP32:
Run 2490 samples, average latency: 92.2313ms per sample.
QAT INT8 (with mul):
Run 2490 samples, average latency: 47.1024ms per sample.

from benchmark.

wojtuss avatar wojtuss commented on May 22, 2024

I have just noticed the above results were mixed. They are updated now.

from benchmark.

wojtuss avatar wojtuss commented on May 22, 2024

FP32 results with @GaoWei8's fix (PaddlePaddle/Paddle#20972):
Run 5010 samples, average latency: 32.0582 ms per sample.

from benchmark.

bingyanghuang avatar bingyanghuang commented on May 22, 2024

Attached the baseline profile results:

  • FP32 with Gaowei's padding patch
Event Calls Total Min. Max. Ave. Ratio.
fc 370666 127408 0.01446 20.14810 0.34373 0.75962
elementwise_add 190342 10271.7 0.03167 4.93884 0.05396 0.06124
transpose2 240432 7704.78 0.02098 3.71560 0.03205 0.04594
matmul 125225 7266.52 0.02186 3.96825 0.05803 0.04332
scale 70126 4823.14 0.00407 4.92560 0.06878 0.02876
layer_norm 125225 4184.69 0.02592 4.72700 0.03342 0.02495
softmax 60108 2565.96 0.03276 2.04443 0.04269 0.01530
reshape2 240432 1383.47 0.00393 2.92184 0.00575 0.00825
lookup_table 15027 1331.17 0.06888 1.23500 0.08859 0.00794
stack 5009 462.643 0.06707 0.30953 0.09236 0.00276
tanh 5009 181.429 0.02849 0.12309 0.03622 0.00108
slice 5009 68.8743 0.01152 0.02409 0.01375 0.00041
fetch 5009 40.985 0.00489 0.01910 0.00818 0.00024
feed 20036 33.6655 0.00066 0.01238 0.00168 0.00020
  • Real INT8 ERNIE profile
Event Calls Total Min. Max. Ave. Ratio.
elementwise_add 290080 48707.3 0.00604 7.29221 0.16791 0.39068
mul 191660 41987.3 0.02023 4.70104 0.21907 0.33678
reshape2 124320 9530.43 0.02449 0.37624 0.07666 0.07644
quantize 129500 5801.52 0.00967 2.96898 0.04480 0.04653
transpose2 124320 4886.13 0.01829 3.68880 0.03930 0.03919
matmul 64750 3845.23 0.02139 15.42340 0.05939 0.03084
scale 36260 2754.43 0.00546 0.14959 0.07596 0.02209
relu 31080 2548.26 0.05381 2.56348 0.08199 0.02044
layer_norm 64750 2127.01 0.02338 1.18454 0.03285 0.01706
softmax 31080 1192.7 0.02731 2.38116 0.03838 0.00957
lookup_table 7770 707.691 0.04808 0.23893 0.09108 0.00568
stack 2590 260.936 0.06343 0.27896 0.10075 0.00209
slice 7770 74.5761 0.00523 0.02923 0.00960 0.00060
tanh 2590 57.4421 0.01791 0.08974 0.02218 0.00046
unsqueeze2 5180 45.0293 0.00550 0.02471 0.00869 0.00036
fill_constant 10360 37.9978 0.00214 0.02271 0.00367 0.00030
fetch 2590 23.3609 0.00753 0.01779 0.00902 0.00019
range 2590 23.1647 0.00674 0.01581 0.00894 0.00019
expand 2590 18.4988 0.00649 0.01439 0.00714 0.00015
cast 5180 16.361 0.00223 0.01195 0.00316 0.00013
equal 2590 12.2261 0.00411 0.01321 0.00472 0.00010
feed 5180 11.6075 0.00127 0.00955 0.00224 0.00009
shape 2590 5.52088 0.00176 0.01129 0.00213 0.00004

from benchmark.

bingyanghuang avatar bingyanghuang commented on May 22, 2024
  1. FP32 ERNIE model and INT8 ERNIE model have slightly different on inputs numbers

  2. INT8 optimized latency is slower than FP32 optimized latency because

    • elementwise_add takes a lot of time after quantizing the mul, fc int8 op should be introduced (support for 3 dims fc input) elementwise_add bias/residual ?

    • Too much reorder time, INT8 pipeline for feed-forward part can be further optimizing the model

    • INT8 execution on GEMM with MKL is without padding (UT MKLDNN INT8 GEMM with the same dimension with FP32 GEMM )

from benchmark.

bingyanghuang avatar bingyanghuang commented on May 22, 2024
MKL-DNN 0.20 from paddle      
aef88b7c2      
20 threads      
Dimension 128x768x768 128x3072x768 128x768x3072
u8s8s32s32 0.0545178 0.170797 0.129593
s8s8s32s32 0.102495 0.326905 0.320389
f32 0.135169 0.419238 0.271828
f32 -> s8 increase 24% 22% -18%
       
1 thread      
Dimension 128x768x768 128x3072x768 128x768x3072
u8s8s32s32 0.21483 0.884561 0.904923
s8s8s32s32 0.638012 2.62521 2.57349
f32 0.848266 3.6114 3.40055
f32 -> s8 increase 25% 27% 24%

from benchmark.

wozna avatar wozna commented on May 22, 2024

@luotao1
I'm working on accuracy test on ernie2 right now .
I've run test on:
ernie_origin_2_inputs.tar.gz fp32 model with 2 inputs
data file - 1.8w.bs1
label file - label.xnli.dev that looks like 2 0 1 2 0 1 0 1 2 1 0 2 ...

I got outputs like below:
--- iteration 204 --- 0.0957182 0.0770019 0.243675 --- iteration 205 --- 0.0957183 0.0770018 0.243675 --- iteration 206 --- 0.0957183 0.0770018 0.243675 --- iteration 207 --- 0.0957182 0.0770018 0.243675 --- iteration 208 --- 0.0957182 0.0770017 0.243675

My question is: am I using the correct file or is there any process for calculating accuracy?

from benchmark.

bingyanghuang avatar bingyanghuang commented on May 22, 2024
MKL-DNN master (v 1.1)      
1ee831fa6a2f802de1d399fe1de4e6cc629ad855      
20 threads      
problem descriptor 128x768x768 128x3072x768 128x768x3072
u8s8s32 0.0446907 0.162134 0.132759
s8s8s32 0.0455766 0.15978 0.13624
f32 0.081141 0.283924 0.467588
f32 -> s8 increase 44% 44% 71%
       
1 thread      
problem descriptor 128x768x768 128x3072x768 128x768x3072
u8s8s32 0.216017 0.925034 0.901553
s8s8s32 0.221654 0.947028 0.942317
f32 0.742698 2.96747 6.87472
f32 -> s8 increase 70% 68% 86%

from benchmark.

wozna avatar wozna commented on May 22, 2024

I repeat my question.
Im checking accuracy on trained ernie models with 2 inputs. But outputs still are float type.
It looks like that:
--- iteration 0 --- -2.22873 -1.84079 3.08339 ...
And label file - label.xnli.dev consists of integers: 2 0 1 2 0 1 0 1 2 1 0 2 ...

Should output labels have a floating point type?

from benchmark.

wozna avatar wozna commented on May 22, 2024

@Meiyim you are right. Thank you for the explanation

from benchmark.

wojtuss avatar wojtuss commented on May 22, 2024

I have updated the results in #275 (comment) above. Results with additionally quantized reshape2 and transpose2 are added.

from benchmark.

luotao1 avatar luotao1 commented on May 22, 2024

FP32, 1 thread: 180.95 ms.

@bingyanghuang @wojtuss Could you tell us why mkldnn update to 1.1 could have such great improvement (250ms -> 180ms)? We really want to know what happens in mkldnn 1.1?

from benchmark.

wojtuss avatar wojtuss commented on May 22, 2024

@luotao1 ,
Today we have found that since our last benchmarks there were some changes in the FC INT8 3D support PR's code (PaddlePaddle/Paddle#21746) which influenced the performance. There is a difference in performance between our previous results, and today's, after merging the PRs into develop branch, both on CLX 6248 and on SKX 6148.
Here are results showing the difference on SKX 6148:

  • previous FP32
    1 thread: 183.92 ms
    20 threads: 31.09 ms
  • current FP32
    1 thread: 214.39 ms
    20 threads: 53.69 ms

We are investigating that.

from benchmark.

wojtuss avatar wojtuss commented on May 22, 2024

@luotao1 ,
The PR with FC INT8 3D is already fixed and updated.

from benchmark.

luotao1 avatar luotao1 commented on May 22, 2024

@GaoWei8 Please have a double-check on fp32 performance.

from benchmark.

GaoWei8 avatar GaoWei8 commented on May 22, 2024

The latest FP32 results for the current clean develop branch (b9dbde12b3) on SKX 6148:

4-dimensional input (fp32_model, test_ds):
1 thread: 249.714 ms
20 threads: 29.741 ms

from benchmark.

bingyanghuang avatar bingyanghuang commented on May 22, 2024

@GaoWei8 Could you please paste your profile log of FP32 baseline in this issue? Let's try to align our baseline.

from benchmark.

GaoWei8 avatar GaoWei8 commented on May 22, 2024

@bingyanghuang
The latest FP32 results for the current clean develop branch (b9dbde12b3) on SKX 6148
4-dimensional input (fp32_model, test_ds):
1 thread profile:

I0102 03:32:40.693799 12448 inference.cc:357] Run 5010 samples, average latency: 249.052 ms per sample.
I0102 03:32:40.693889 12448 inference.cc:362] Run 5009 samples, average latency [exclude 1 warmup steps]: 249.05 ms per sample.

------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by total time in descending order in the same thread

Event                       Calls       Total       Min.        Max.        Ave.        Ratio.
thread0::fc                 370740      1.10423e+06 0.016733    6.69891     2.97844     0.885518
thread0::elementwise_add    190380      43708.8     0.115389    0.955863    0.229587    0.0350516
thread0::matmul             125250      33400       0.025127    2.02142     0.266667    0.0267846
thread0::transpose2         240480      22726.5     0.069161    0.33388     0.0945046   0.0182251
thread0::layer_norm         125250      19296       0.138214    0.232502    0.15406     0.0154741
thread0::softmax            60120       14590.1     0.236914    0.48581     0.242683    0.0117003
thread0::scale              70140       4239.09     0.006753    0.116512    0.0604376   0.00339947
thread0::reshape2           240480      2274.57     0.0074      0.055957    0.00945847  0.00182406
thread0::lookup_table       15030       1274.28     0.076111    0.160232    0.0847821   0.00102188
thread0::stack              5010        646.453     0.121365    0.180305    0.129033    0.000518413
thread0::load               202         270.43      0.008871    169.244     1.33876     0.000216867
thread0::tanh               5010        146.669     0.026564    0.197547    0.0292752   0.000117618
thread0::slice              5010        85.3612     0.015276    0.045218    0.0170382   6.84541e-05
thread0::feed               20040       56.402      0.001333    0.011693    0.00281447  4.52307e-05
thread0::fetch              5010        43.0507     0.00684     0.043593    0.00859295  3.45238e-05

20 threads profile:

I0102 03:11:21.370021 12402 inference.cc:357] Run 5010 samples, average latency: 29.5513 ms per sample.
I0102 03:11:21.370100 12402 inference.cc:362] Run 5009 samples, average latency [exclude 1 warmup steps]: 29.5477 ms per sample.

------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by total time in descending order in the same thread

Event                       Calls       Total       Min.        Max.        Ave.        Ratio.
thread0::fc                 370740      108360      0.016532    18.4293     0.292281    0.73505
thread0::elementwise_add    190380      9398.93     0.027077    7.27942     0.0493693   0.0637567
thread0::matmul             125250      7594.48     0.021428    9.68409     0.0606346   0.0515164
thread0::transpose2         240480      7444.04     0.02215     7.14557     0.0309549   0.0504959
thread0::layer_norm         125250      4383.38     0.029262    6.79018     0.0349971   0.0297342
thread0::scale              70140       3930.57     0.00581     7.21213     0.0560389   0.0266626
thread0::softmax            60120       2166.03     0.029904    7.0274      0.0360284   0.014693
thread0::reshape2           240480      1896.58     0.006059    7.18181     0.00788662  0.0128652
thread0::lookup_table       15030       1283.8      0.070953    7.12537     0.0854156   0.0087085
thread0::stack              5010        376.056     0.063878    0.828854    0.075061    0.00255093
thread0::load               202         272.917     0.009124    171.089     1.35107     0.0018513
thread0::tanh               5010        136.909     0.024189    0.195245    0.0273272   0.00092871
thread0::slice              5010        81.6364     0.014465    0.043562    0.0162947   0.000553772
thread0::feed               20040       54.9063     0.00108     0.024591    0.00273984  0.000372451
thread0::fetch              5010        38.3997     0.00608     0.039843    0.00766461  0.00026048

from benchmark.

bingyanghuang avatar bingyanghuang commented on May 22, 2024

The latest FP32 results for the clean develop branch (c7b03d308c, Jan 2nd) on CLX 6248
4-dimensional input (fp32_model, test_ds):
1 thread profile:

I0103 04:47:21.063290 111775 inference.cc:354] Run 5010 samples, average latency: 186.189 ms per sample.
I0103 04:47:21.063355 111775 inference.cc:359] Run 5009 samples, average latency [exclude 1 warmup steps]: 186.18 ms per sample.

------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by total time in descending order in the same thread

Event                       Calls       Total       Min.        Max.        Ave.        Ratio.
thread0::fc                 370740      789200      0.012302    8.10634     2.12871     0.846926
thread0::elementwise_add    190380      67363.2     0.124292    11.9827     0.353835    0.0722904
thread0::matmul             125250      25282.5     0.021207    9.8986      0.201856    0.0271318
thread0::transpose2         240480      18951.5     0.061624    0.722954    0.078807    0.0203377
thread0::layer_norm         125250      14751.8     0.102778    0.470418    0.117778    0.0158308
thread0::softmax            60120       8870.11     0.138533    0.872575    0.14754     0.00951892
thread0::scale              70140       3878.18     0.004727    0.113928    0.055292    0.00416185
thread0::reshape2           240480      1665.39     0.005301    0.057911    0.00692528  0.00178721
thread0::lookup_table       15030       1000.76     0.054471    0.126544    0.0665843   0.00107396
thread0::stack              5010        470.501     0.083346    0.181878    0.0939125   0.000504916
thread0::load               202         151.379     0.007597    19.1899     0.7494      0.000162451
thread0::tanh               5010        116.446     0.020674    0.164315    0.0232427   0.000124964
thread0::slice              5010        63.1702     0.011075    0.037661    0.0126088   6.77908e-05
thread0::feed               20040       44.1006     0.000886    0.006938    0.00220063  4.73264e-05
thread0::fetch              5010        31.8417     0.00492     0.040041    0.00635563  3.41708e-05

from benchmark.

wojtuss avatar wojtuss commented on May 22, 2024

Here come our latest performance results for Ernie FP32 and INT8 runs. The tests were run with affinity settings

export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1

on CLX 6248.

With the current develop branch (ad0dfb1) and the origin FP32 model, the latency was:
FP32, 20 threads: 29.73 ms,
FP32, 1 thread: 196.86 ms.

INT8 results with fc, reshape2 and transpose2 quantized:
INT8, 20 threads: 20.26 ms (31.8 % faster),
INT8, 1 thread: 78.40 ms (60.2 % faster).

from benchmark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.