After merging the PR <a class="issue-link js-issue-link" data-error-text="Failed to lo

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

HI <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Dev branch cpu training problem (with conv and pool),about apache/singa

Comments (29)

dcslin commented on July 23, 2024 1

The code is great! It resolved the problem. I think you can send the PR

thanks for testing

from singa.

chrishkchris commented on July 23, 2024

It does not hangs, it is just too slow, much slower than the previous version

from singa.

chrishkchris commented on July 23, 2024

Using the mnist_cnn.py in CPU training:

The dev branch takes 14.469275s for one batch of training and one batch of evaulation.
The master branch takes 0.08s for one batch of training and one batch of evaulation.

So maybe something wrong.

from singa.

chrishkchris commented on July 23, 2024

I profiled the time for different operations in mnist.py:
conv1: 0.5253171920776367
relu: 0.015198707580566406
pooling1: 0.0033032894134521484
conv2: 2.7908501625061035
relu: 0.004743099212646484
pooling2: 0.0012176036834716797
flatten: 0.00013828277587890625
linear1: 0.0011014938354492188
relu: 0.0008671283721923828

from singa.

chrishkchris commented on July 23, 2024

I think need to double check the conv2d

from singa.

chrishkchris commented on July 23, 2024

I have tried using GCC OpenMP and Intel TBB (threading building block) when complile DNNL from source.

The time is extremely slow (normal time per epoch should be around a minute), but the training loss results are correct.

GCC OpenMP

root@3edb30e30b08:~/dcsysh/singa/examples/autograd# python3 mnist_cnn.py
Starting Epoch 0:
Training loss = 564.547180, training accuracy = 0.800644
Evaluation accuracy = 0.931591, Elapsed Time = 1348.363244s
Starting Epoch 1:
Training loss = 229.964905, training accuracy = 0.922892
Evaluation accuracy = 0.959535, Elapsed Time = 1344.685418s
Starting Epoch 2:
Training loss = 163.646332, training accuracy = 0.944837
Evaluation accuracy = 0.973758, Elapsed Time = 1346.530425s
Starting Epoch 3:
Training loss = 135.699615, training accuracy = 0.954526
Evaluation accuracy = 0.970152, Elapsed Time = 1346.398193s
Starting Epoch 4:
Training loss = 115.944962, training accuracy = 0.962096
Evaluation accuracy = 0.968750, Elapsed Time = 1349.933991s
Starting Epoch 5:
Training loss = 102.581963, training accuracy = 0.965548
Evaluation accuracy = 0.976963, Elapsed Time = 1343.627475s
Starting Epoch 6:
Training loss = 91.995560, training accuracy = 0.969701
Evaluation accuracy = 0.980168, Elapsed Time = 1345.709435s
Starting Epoch 7:
Training loss = 85.334785, training accuracy = 0.971051
Evaluation accuracy = 0.977664, Elapsed Time = 1342.384448s
Starting Epoch 8:
Training loss = 81.609375, training accuracy = 0.972018
Evaluation accuracy = 0.981571, Elapsed Time = 1345.214866s
Starting Epoch 9:
Training loss = 76.690147, training accuracy = 0.974203
Evaluation accuracy = 0.977364, Elapsed Time = 1354.111479s

TBB (threading building block)

root@3edb30e30b08:~/dcsysh/singa/examples/autograd# python3 mnist_cnn.py
Starting Epoch 0:
Training loss = 566.089539, training accuracy = 0.800527
Evaluation accuracy = 0.938201, Elapsed Time = 1571.624848s
Starting Epoch 1:
Training loss = 229.882874, training accuracy = 0.923192
Evaluation accuracy = 0.957833, Elapsed Time = 1569.219801s
Starting Epoch 2:
Training loss = 164.734573, training accuracy = 0.945137
Evaluation accuracy = 0.955929, Elapsed Time = 1567.359108s
Starting Epoch 3:
Training loss = 132.956802, training accuracy = 0.955310
Evaluation accuracy = 0.968550, Elapsed Time = 1572.159664s
Starting Epoch 4:
Training loss = 117.263237, training accuracy = 0.960646
Evaluation accuracy = 0.969151, Elapsed Time = 1570.090345s
Starting Epoch 5:
Training loss = 105.917274, training accuracy = 0.965115
Evaluation accuracy = 0.978466, Elapsed Time = 1569.966338s
Starting Epoch 6:
Training loss = 93.056519, training accuracy = 0.968700
Evaluation accuracy = 0.976362, Elapsed Time = 1571.289907s
Starting Epoch 7:
Training loss = 85.500954, training accuracy = 0.971101
Evaluation accuracy = 0.981771, Elapsed Time = 1572.169596s

The old mkldnn in master branch, results copied from PR #579

ubuntu@ip-172-31-24-48:~/singa/examples/autograd$ python3 mnist_cnn.py
Starting Epoch 0:
Training loss = 585.431152, training accuracy = 0.791739
Evaluation accuracy = 0.930088, Elapsed Time = 55.447133s
Starting Epoch 1:
Training loss = 232.831589, training accuracy = 0.922158
Evaluation accuracy = 0.967949, Elapsed Time = 55.337850s
Starting Epoch 2:
Training loss = 166.067307, training accuracy = 0.945788
Evaluation accuracy = 0.968550, Elapsed Time = 55.367847s
Starting Epoch 3:
Training loss = 136.865341, training accuracy = 0.954092
Evaluation accuracy = 0.973357, Elapsed Time = 55.358584s
Starting Epoch 4:
Training loss = 118.813286, training accuracy = 0.960195
Evaluation accuracy = 0.979567, Elapsed Time = 55.270505s
Starting Epoch 5:
Training loss = 106.185112, training accuracy = 0.964481
Evaluation accuracy = 0.975962, Elapsed Time = 55.281344s
Starting Epoch 6:
Training loss = 94.444023, training accuracy = 0.968016
Evaluation accuracy = 0.980970, Elapsed Time = 55.081426s
Starting Epoch 7:
Training loss = 88.213493, training accuracy = 0.970418
Evaluation accuracy = 0.982873, Elapsed Time = 54.912524s
Starting Epoch 8:
Training loss = 81.126442, training accuracy = 0.972886
Evaluation accuracy = 0.981470, Elapsed Time = 54.907317s
Starting Epoch 9:
Training loss = 77.790993, training accuracy = 0.974236
Evaluation accuracy = 0.974159, Elapsed Time = 54.915229s

So the dnnl may be around 300 times slower than the old mkldnn?

from singa.

chrishkchris commented on July 23, 2024

I have found that the reason, the format of the memory tag should be any in order to use the direct algorithm, otherwise the implementation will silently fall back to an explicit GEMM algorithm which is slower:
https://intel.github.io/mkl-dnn/dev_guide_convolution.html

An example code of the correct implementation is:
https://intel.github.io/mkl-dnn/cnn_training_f32_8cpp-example.html

@dcslin I will try to modify the code myself first. If I find it too complex I will ask for your help because you are more familiar with the dnnl library. Thanks!

from singa.

chrishkchris commented on July 23, 2024

@dcslin

I have tried to reorder the format before passing into conv, but seems I am not familiar with the DNNL API and too difficult to debug
See the reorder descripter I was trying and the new memory descriptor I added:
https://github.com/chrishkchris/singa/blob/conv_reorder/src/model/operation/convolution.cc#L111

From my understanding, the key concept and the step-by-step-prodcedure is:

Create memory decriptor of tag::any format of conv2d for the input, weight, bias, output
Create the conv primitive descriptor based on the memory decriptor created above
Reorder the input and weight format to the tag::any format
the reordered format input and weight can be passed into the conv for processing

Could you help to make this work, thanks!

from singa.

dcslin commented on July 23, 2024

HI @chrishkchris , Thank you for raising the issue. I have tried on your branch, the function 'order' is behaving very weird. It seems that it does not work well with the memory we passed in.

from singa.

chrishkchris commented on July 23, 2024

@dcslin I just realized that should use the cpp example instead of c example because the API are different.
https://intel.github.io/mkl-dnn/cnn_training_f32_8cpp-example.html

from singa.

chrishkchris commented on July 23, 2024

HI @chrishkchris , Thank you for raising the issue. I have tried on your branch, the function 'order' is behaving very weird. It seems that it does not work well with the memory we passed in.

yes, sorry, by reading cpp example the way to create a reorder primitive_desc in cpp
the example code use just:
reorder(conv_user_src_memory, conv_src_memory)
to convert the format from source to conv kernel format
for execute, pass the stream and parameter
{{DNNL_ARG_FROM, conv_user_src_memory},
{DNNL_ARG_TO, conv_src_memory}}

from singa.

chrishkchris commented on July 23, 2024

HI @chrishkchris , Thank you for raising the issue. I have tried on your branch, the function 'order' is behaving very weird. It seems that it does not work well with the memory we passed in.

@dcslin
I updated the code, the reorder primitive is moved to here:
https://github.com/chrishkchris/singa/blob/conv_reorder/src/model/operation/convolution.cc#L162
now the error log is

[ RUN      ] DNNLOperation_Convolution.Forward
pass0
pass1
unknown file: Failure
C++ exception with description "could not create a memory" thrown in the test body.
[  FAILED  ] DNNLOperation_Convolution.Forward (0 ms)

from singa.

dcslin commented on July 23, 2024

Hi @chrishkchris , FYI
https://github.com/dcslin/singa/blob/conv_reorder/src/model/operation/convolution.cc

from singa.

dcslin commented on July 23, 2024

Hi @chrishkchris , I am still working on this issue, but I am encountering something very weird.

I borrow the code from dnnl example and testing result shows similar performance compared to mkldnn(conv forward in 5000~microsec).

However, after some modification to the example code(just introducing Singa Tensor dcslin@06ec6ce), I still can get reasonable performance(conv forward in 5000~ microsec). But the weird part is, the result is not stable, out of 10 times testing, segmentation fault appear ~6 times.

from singa.

chrishkchris commented on July 23, 2024

Hi @chrishkchris , I am still working on this issue, but I am encountering something very weird.
I borrow the code from dnnl example and testing result shows similar performance compared to mkldnn(conv forward in 5000microsec).
However, after some modification to the example code(just introducing Singa Tensor dcslin/singa@06ec6ce), I still can get reasonable performance(conv forward in 5000 microsec). But the weird part is, the result is not stable, out of 10 times testing, segmentation fault appear ~6 times.

maybe can try adding s.wait() after each reorder, the stream process them one by one

from singa.

chrishkchris commented on July 23, 2024

Hi @chrishkchris , I am still working on this issue, but I am encountering something very weird.
I borrow the code from dnnl example and testing result shows similar performance compared to mkldnn(conv forward in 5000microsec).
However, after some modification to the example code(just introducing Singa Tensor dcslin/singa@06ec6ce), I still can get reasonable performance(conv forward in 5000 microsec). But the weird part is, the result is not stable, out of 10 times testing, segmentation fault appear ~6 times.

I run your example code but did not find any problem, it display mytestok. The last segmentation fault of "double free" is due to freeing of variables.

[ RUN      ] MYTEST.Forward
[total]Time difference = 58228[mu s]
[avg]Time difference = 582[mu s]
mytestok
double free or corruption (!prev)
Aborted (core dumped)

from singa.

dcslin commented on July 23, 2024

Hi @chrishkchris , I am still working on this issue, but I am encountering something very weird.
I borrow the code from dnnl example and testing result shows similar performance compared to mkldnn(conv forward in 5000microsec).
However, after some modification to the example code(just introducing Singa Tensor dcslin/singa@06ec6ce), I still can get reasonable performance(conv forward in 5000 microsec). But the weird part is, the result is not stable, out of 10 times testing, segmentation fault appear ~6 times.

I run your example code but did not find any problem, it display mytestok. The last segmentation fault of "double free" is due to freeing of variables.
[ RUN      ] MYTEST.Forward
[total]Time difference = 58228[mu s]
[avg]Time difference = 582[mu s]
mytestok
double free or corruption (!prev)
Aborted (core dumped)

do you how to fix this?

from singa.

chrishkchris commented on July 23, 2024

Hi @chrishkchris , I am still working on this issue, but I am encountering something very weird.
I borrow the code from dnnl example and testing result shows similar performance compared to mkldnn(conv forward in 5000microsec).
However, after some modification to the example code(just introducing Singa Tensor dcslin/singa@06ec6ce), I still can get reasonable performance(conv forward in 5000 microsec). But the weird part is, the result is not stable, out of 10 times testing, segmentation fault appear ~6 times.

I run your example code but did not find any problem, it display mytestok. The last segmentation fault of "double free" is due to freeing of variables.
[ RUN ] MYTEST.Forward
[total]Time difference = 58228[mu s]
[avg]Time difference = 582[mu s]
mytestok
double free or corruption (!prev)
Aborted (core dumped)

do you how to fix this?

OK now, I tried by turning the tensor to pointer.

  Tensor *in = new Tensor(Shape{batch, in_chan, image_h, image_h});
  Tensor *out = new Tensor(Shape{batch, out_chan, out_size, out_size});
  Tensor *weights = new Tensor(Shape{out_chan, in_chan, ker, ker});
  Tensor *bias = new Tensor(Shape{out_chan});
  in->SetValue(1.0f);
  weights->SetValue(1.0f);
  bias->SetValue(1.0f);

Now it display:

[----------] 1 test from MYTEST
[ RUN      ] MYTEST.Forward
[total]Time difference = 51943[mu s]
[avg]Time difference = 519[mu s]
mytestok
[       OK ] MYTEST.Forward (53 ms)
[----------] 1 test from MYTEST (53 ms total)

so I suspect that when it destruct dnnl memory object, it will free the memory from in->block()->mutable_data …
For the same reason, I suspect that in the constructor of dnnl memory object, it will reallocate memory based on the pointer given.

The solution is to avoid passing the block memory pointer to the dnnl memory constructor, instead we copy the block memory to the dnnl memory after dnnl memory object is constructed.

from singa.

dcslin commented on July 23, 2024

@chrishkchris , in fact, even disabling reorder part, the performance is still the same

Hi @chrishkchris , I am still working on this issue, but I am encountering something very weird.
I borrow the code from dnnl example and testing result shows similar performance compared to mkldnn(conv forward in 5000microsec).
However, after some modification to the example code(just introducing Singa Tensor dcslin/singa@06ec6ce), I still can get reasonable performance(conv forward in 5000 microsec). But the weird part is, the result is not stable, out of 10 times testing, segmentation fault appear ~6 times.
I run your example code but did not find any problem, it display mytestok. The last segmentation fault of "double free" is due to freeing of variables.
[ RUN ] MYTEST.Forward
[total]Time difference = 58228[mu s]
[avg]Time difference = 582[mu s]
mytestok
double free or corruption (!prev)
Aborted (core dumped)
do you how to fix this?

OK now, I tried by turning the tensor to pointer.
  Tensor *in = new Tensor(Shape{batch, in_chan, image_h, image_h});
  Tensor *out = new Tensor(Shape{batch, out_chan, out_size, out_size});
  Tensor *weights = new Tensor(Shape{out_chan, in_chan, ker, ker});
  Tensor *bias = new Tensor(Shape{out_chan});
  in->SetValue(1.0f);
  weights->SetValue(1.0f);
  bias->SetValue(1.0f);
Now it display:
[----------] 1 test from MYTEST
[ RUN      ] MYTEST.Forward
[total]Time difference = 51943[mu s]
[avg]Time difference = 519[mu s]
mytestok
[       OK ] MYTEST.Forward (53 ms)
[----------] 1 test from MYTEST (53 ms total)
so I suspect that when it destruct dnnl memory object, it will free the memory from in->block()->mutable_data …
For the same reason, I suspect that in the constructor of dnnl memory object, it will reallocate memory based on the pointer given.

The solution is to avoid passing the block memory pointer to the dnnl memory constructor, instead we copy the block memory to the dnnl memory after dnnl memory object is constructed.

will it cost double memory usage?

from singa.

dcslin commented on July 23, 2024

somehow i fixed conv forward,
dcslin@8e85fdc
it's now in the magnitude of ~5000microsec which is similar to mkldnn.

I see backward also has problem, still checking

from singa.

chrishkchris commented on July 23, 2024

somehow i fixed conv forward,
dcslin/singa@8e85fdc
it's now in the magnitude of ~5000microsec which is similar to mkldnn.
I see backward also has problem, still checking

I have read the updated code
so the error is caused by dnnl::prop_kind::forward?

from singa.

dcslin commented on July 23, 2024

somehow i fixed conv forward,
dcslin/singa@8e85fdc
it's now in the magnitude of ~5000microsec which is similar to mkldnn.
I see backward also has problem, still checking

I have read the updated code
so the error is caused by dnnl::prop_kind::forward?

not sure

from singa.

dcslin commented on July 23, 2024

@chrishkchris could you please help to test the performance of this branch? it looks ok at my side.
let me know if the performance is ok?
https://github.com/dcslin/singa/tree/dnnl-perf-issue

I will clean the code

from singa.

chrishkchris commented on July 23, 2024

@chrishkchris could you please help to test the performance of this branch? it looks ok at my side.
let me know if the performance is ok?
https://github.com/dcslin/singa/tree/dnnl-perf-issue
I will clean the code

Thanks a lot! I will test and let you know

from singa.

chrishkchris commented on July 23, 2024

@dcslin Thank you so much!
It is much much faster than before. The training loss is correct.

root@71b7b910ae0b:~/dcsysh/singa/examples/autograd# python3 mnist_cnn.py
Starting Epoch 0:
Training loss = 577.733337, training accuracy = 0.796858
Evaluation accuracy = 0.937300, Elapsed Time = 139.402045s
Starting Epoch 1:
Training loss = 234.865036, training accuracy = 0.920891
Evaluation accuracy = 0.954327, Elapsed Time = 138.340753s
Starting Epoch 2:
Training loss = 174.085022, training accuracy = 0.941402
Evaluation accuracy = 0.973157, Elapsed Time = 138.636959s
Starting Epoch 3:
Training loss = 137.803268, training accuracy = 0.953475
Evaluation accuracy = 0.966246, Elapsed Time = 138.846091s
Starting Epoch 4:
Training loss = 116.644051, training accuracy = 0.961963
Evaluation accuracy = 0.974459, Elapsed Time = 139.520643s
Starting Epoch 5:
Training loss = 104.833313, training accuracy = 0.964915
Evaluation accuracy = 0.976562, Elapsed Time = 139.725914s
Starting Epoch 6:
Training loss = 95.696701, training accuracy = 0.967466
Evaluation accuracy = 0.979066, Elapsed Time = 141.454090s
Starting Epoch 7:
Training loss = 87.961937, training accuracy = 0.969784
Evaluation accuracy = 0.983474, Elapsed Time = 137.189445s
Starting Epoch 8:
Training loss = 81.685951, training accuracy = 0.972002
Evaluation accuracy = 0.982772, Elapsed Time = 138.908067s
Starting Epoch 9:
Training loss = 76.445320, training accuracy = 0.973986
Evaluation accuracy = 0.983073, Elapsed Time = 136.998992s

from singa.

chrishkchris commented on July 23, 2024

@dcslin I guess you will need to change the test_operation_convolution.cc back

from singa.

dcslin commented on July 23, 2024

@dcslin I guess you will need to change the test_operation_convolution.cc back

yes sure

from singa.

chrishkchris commented on July 23, 2024

The code is great! It resolved the problem. I think you can send the PR

from singa.

chrishkchris commented on July 23, 2024

This issue can be resolved by PR #605

from singa.

Dev branch cpu training problem (with conv and pool) about singa HOT 29 CLOSED

Comments (29)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent