Comments (29)
The code is great! It resolved the problem. I think you can send the PR
thanks for testing
from singa.
It does not hangs, it is just too slow, much slower than the previous version
from singa.
Using the mnist_cnn.py in CPU training:
The dev branch takes 14.469275s for one batch of training and one batch of evaulation.
The master branch takes 0.08s for one batch of training and one batch of evaulation.
So maybe something wrong.
from singa.
I profiled the time for different operations in mnist.py:
conv1: 0.5253171920776367
relu: 0.015198707580566406
pooling1: 0.0033032894134521484
conv2: 2.7908501625061035
relu: 0.004743099212646484
pooling2: 0.0012176036834716797
flatten: 0.00013828277587890625
linear1: 0.0011014938354492188
relu: 0.0008671283721923828
from singa.
I think need to double check the conv2d
from singa.
I have tried using GCC OpenMP and Intel TBB (threading building block) when complile DNNL from source.
The time is extremely slow (normal time per epoch should be around a minute), but the training loss results are correct.
- GCC OpenMP
root@3edb30e30b08:~/dcsysh/singa/examples/autograd# python3 mnist_cnn.py
Starting Epoch 0:
Training loss = 564.547180, training accuracy = 0.800644
Evaluation accuracy = 0.931591, Elapsed Time = 1348.363244s
Starting Epoch 1:
Training loss = 229.964905, training accuracy = 0.922892
Evaluation accuracy = 0.959535, Elapsed Time = 1344.685418s
Starting Epoch 2:
Training loss = 163.646332, training accuracy = 0.944837
Evaluation accuracy = 0.973758, Elapsed Time = 1346.530425s
Starting Epoch 3:
Training loss = 135.699615, training accuracy = 0.954526
Evaluation accuracy = 0.970152, Elapsed Time = 1346.398193s
Starting Epoch 4:
Training loss = 115.944962, training accuracy = 0.962096
Evaluation accuracy = 0.968750, Elapsed Time = 1349.933991s
Starting Epoch 5:
Training loss = 102.581963, training accuracy = 0.965548
Evaluation accuracy = 0.976963, Elapsed Time = 1343.627475s
Starting Epoch 6:
Training loss = 91.995560, training accuracy = 0.969701
Evaluation accuracy = 0.980168, Elapsed Time = 1345.709435s
Starting Epoch 7:
Training loss = 85.334785, training accuracy = 0.971051
Evaluation accuracy = 0.977664, Elapsed Time = 1342.384448s
Starting Epoch 8:
Training loss = 81.609375, training accuracy = 0.972018
Evaluation accuracy = 0.981571, Elapsed Time = 1345.214866s
Starting Epoch 9:
Training loss = 76.690147, training accuracy = 0.974203
Evaluation accuracy = 0.977364, Elapsed Time = 1354.111479s
- TBB (threading building block)
root@3edb30e30b08:~/dcsysh/singa/examples/autograd# python3 mnist_cnn.py
Starting Epoch 0:
Training loss = 566.089539, training accuracy = 0.800527
Evaluation accuracy = 0.938201, Elapsed Time = 1571.624848s
Starting Epoch 1:
Training loss = 229.882874, training accuracy = 0.923192
Evaluation accuracy = 0.957833, Elapsed Time = 1569.219801s
Starting Epoch 2:
Training loss = 164.734573, training accuracy = 0.945137
Evaluation accuracy = 0.955929, Elapsed Time = 1567.359108s
Starting Epoch 3:
Training loss = 132.956802, training accuracy = 0.955310
Evaluation accuracy = 0.968550, Elapsed Time = 1572.159664s
Starting Epoch 4:
Training loss = 117.263237, training accuracy = 0.960646
Evaluation accuracy = 0.969151, Elapsed Time = 1570.090345s
Starting Epoch 5:
Training loss = 105.917274, training accuracy = 0.965115
Evaluation accuracy = 0.978466, Elapsed Time = 1569.966338s
Starting Epoch 6:
Training loss = 93.056519, training accuracy = 0.968700
Evaluation accuracy = 0.976362, Elapsed Time = 1571.289907s
Starting Epoch 7:
Training loss = 85.500954, training accuracy = 0.971101
Evaluation accuracy = 0.981771, Elapsed Time = 1572.169596s
- The old mkldnn in master branch, results copied from PR #579
ubuntu@ip-172-31-24-48:~/singa/examples/autograd$ python3 mnist_cnn.py
Starting Epoch 0:
Training loss = 585.431152, training accuracy = 0.791739
Evaluation accuracy = 0.930088, Elapsed Time = 55.447133s
Starting Epoch 1:
Training loss = 232.831589, training accuracy = 0.922158
Evaluation accuracy = 0.967949, Elapsed Time = 55.337850s
Starting Epoch 2:
Training loss = 166.067307, training accuracy = 0.945788
Evaluation accuracy = 0.968550, Elapsed Time = 55.367847s
Starting Epoch 3:
Training loss = 136.865341, training accuracy = 0.954092
Evaluation accuracy = 0.973357, Elapsed Time = 55.358584s
Starting Epoch 4:
Training loss = 118.813286, training accuracy = 0.960195
Evaluation accuracy = 0.979567, Elapsed Time = 55.270505s
Starting Epoch 5:
Training loss = 106.185112, training accuracy = 0.964481
Evaluation accuracy = 0.975962, Elapsed Time = 55.281344s
Starting Epoch 6:
Training loss = 94.444023, training accuracy = 0.968016
Evaluation accuracy = 0.980970, Elapsed Time = 55.081426s
Starting Epoch 7:
Training loss = 88.213493, training accuracy = 0.970418
Evaluation accuracy = 0.982873, Elapsed Time = 54.912524s
Starting Epoch 8:
Training loss = 81.126442, training accuracy = 0.972886
Evaluation accuracy = 0.981470, Elapsed Time = 54.907317s
Starting Epoch 9:
Training loss = 77.790993, training accuracy = 0.974236
Evaluation accuracy = 0.974159, Elapsed Time = 54.915229s
So the dnnl may be around 300 times slower than the old mkldnn?
from singa.
I have found that the reason, the format of the memory tag should be any in order to use the direct algorithm, otherwise the implementation will silently fall back to an explicit GEMM algorithm which is slower:
https://intel.github.io/mkl-dnn/dev_guide_convolution.html
An example code of the correct implementation is:
https://intel.github.io/mkl-dnn/cnn_training_f32_8cpp-example.html
@dcslin I will try to modify the code myself first. If I find it too complex I will ask for your help because you are more familiar with the dnnl library. Thanks!
from singa.
I have tried to reorder the format before passing into conv, but seems I am not familiar with the DNNL API and too difficult to debug
See the reorder descripter I was trying and the new memory descriptor I added:
https://github.com/chrishkchris/singa/blob/conv_reorder/src/model/operation/convolution.cc#L111
From my understanding, the key concept and the step-by-step-prodcedure is:
- Create memory decriptor of tag::any format of conv2d for the input, weight, bias, output
- Create the conv primitive descriptor based on the memory decriptor created above
- Reorder the input and weight format to the tag::any format
- the reordered format input and weight can be passed into the conv for processing
Could you help to make this work, thanks!
from singa.
HI @chrishkchris , Thank you for raising the issue. I have tried on your branch, the function 'order' is behaving very weird. It seems that it does not work well with the memory we passed in.
from singa.
@dcslin I just realized that should use the cpp example instead of c example because the API are different.
https://intel.github.io/mkl-dnn/cnn_training_f32_8cpp-example.html
from singa.
HI @chrishkchris , Thank you for raising the issue. I have tried on your branch, the function 'order' is behaving very weird. It seems that it does not work well with the memory we passed in.
yes, sorry, by reading cpp example the way to create a reorder primitive_desc in cpp
the example code use just:
reorder(conv_user_src_memory, conv_src_memory)
to convert the format from source to conv kernel format
for execute, pass the stream and parameter
{{DNNL_ARG_FROM, conv_user_src_memory},
{DNNL_ARG_TO, conv_src_memory}}
from singa.
HI @chrishkchris , Thank you for raising the issue. I have tried on your branch, the function 'order' is behaving very weird. It seems that it does not work well with the memory we passed in.
@dcslin
I updated the code, the reorder primitive is moved to here:
https://github.com/chrishkchris/singa/blob/conv_reorder/src/model/operation/convolution.cc#L162
now the error log is
[ RUN ] DNNLOperation_Convolution.Forward
pass0
pass1
unknown file: Failure
C++ exception with description "could not create a memory" thrown in the test body.
[ FAILED ] DNNLOperation_Convolution.Forward (0 ms)
from singa.
Hi @chrishkchris , FYI
https://github.com/dcslin/singa/blob/conv_reorder/src/model/operation/convolution.cc
from singa.
Hi @chrishkchris , I am still working on this issue, but I am encountering something very weird.
I borrow the code from dnnl example and testing result shows similar performance compared to mkldnn(conv forward in 5000~microsec).
However, after some modification to the example code(just introducing Singa Tensor dcslin@06ec6ce), I still can get reasonable performance(conv forward in 5000~ microsec). But the weird part is, the result is not stable, out of 10 times testing, segmentation fault
appear ~6 times.
from singa.
Hi @chrishkchris , I am still working on this issue, but I am encountering something very weird.
I borrow the code from dnnl example and testing result shows similar performance compared to mkldnn(conv forward in 5000microsec).microsec). But the weird part is, the result is not stable, out of 10 times testing, segmentation fault appear ~6 times.
However, after some modification to the example code(just introducing Singa Tensor dcslin/singa@06ec6ce), I still can get reasonable performance(conv forward in 5000
maybe can try adding s.wait() after each reorder, the stream process them one by one
from singa.
Hi @chrishkchris , I am still working on this issue, but I am encountering something very weird.
I borrow the code from dnnl example and testing result shows similar performance compared to mkldnn(conv forward in 5000microsec).microsec). But the weird part is, the result is not stable, out of 10 times testing, segmentation fault appear ~6 times.
However, after some modification to the example code(just introducing Singa Tensor dcslin/singa@06ec6ce), I still can get reasonable performance(conv forward in 5000
I run your example code but did not find any problem, it display mytestok. The last segmentation fault of "double free" is due to freeing of variables.
[ RUN ] MYTEST.Forward
[total]Time difference = 58228[mu s]
[avg]Time difference = 582[mu s]
mytestok
double free or corruption (!prev)
Aborted (core dumped)
from singa.
Hi @chrishkchris , I am still working on this issue, but I am encountering something very weird.
I borrow the code from dnnl example and testing result shows similar performance compared to mkldnn(conv forward in 5000microsec).microsec). But the weird part is, the result is not stable, out of 10 times testing, segmentation fault appear ~6 times.
However, after some modification to the example code(just introducing Singa Tensor dcslin/singa@06ec6ce), I still can get reasonable performance(conv forward in 5000I run your example code but did not find any problem, it display mytestok. The last segmentation fault of "double free" is due to freeing of variables.
[ RUN ] MYTEST.Forward [total]Time difference = 58228[mu s] [avg]Time difference = 582[mu s] mytestok double free or corruption (!prev) Aborted (core dumped)
do you how to fix this?
from singa.
Hi @chrishkchris , I am still working on this issue, but I am encountering something very weird.
I borrow the code from dnnl example and testing result shows similar performance compared to mkldnn(conv forward in 5000microsec).
However, after some modification to the example code(just introducing Singa Tensor dcslin/singa@06ec6ce), I still can get reasonable performance(conv forward in 5000 microsec). But the weird part is, the result is not stable, out of 10 times testing, segmentation fault appear ~6 times.I run your example code but did not find any problem, it display mytestok. The last segmentation fault of "double free" is due to freeing of variables.
[ RUN ] MYTEST.Forward
[total]Time difference = 58228[mu s]
[avg]Time difference = 582[mu s]
mytestok
double free or corruption (!prev)
Aborted (core dumped)do you how to fix this?
OK now, I tried by turning the tensor to pointer.
Tensor *in = new Tensor(Shape{batch, in_chan, image_h, image_h});
Tensor *out = new Tensor(Shape{batch, out_chan, out_size, out_size});
Tensor *weights = new Tensor(Shape{out_chan, in_chan, ker, ker});
Tensor *bias = new Tensor(Shape{out_chan});
in->SetValue(1.0f);
weights->SetValue(1.0f);
bias->SetValue(1.0f);
Now it display:
[----------] 1 test from MYTEST
[ RUN ] MYTEST.Forward
[total]Time difference = 51943[mu s]
[avg]Time difference = 519[mu s]
mytestok
[ OK ] MYTEST.Forward (53 ms)
[----------] 1 test from MYTEST (53 ms total)
so I suspect that when it destruct dnnl memory object, it will free the memory from in->block()->mutable_data …
For the same reason, I suspect that in the constructor of dnnl memory object, it will reallocate memory based on the pointer given.
The solution is to avoid passing the block memory pointer to the dnnl memory constructor, instead we copy the block memory to the dnnl memory after dnnl memory object is constructed.
from singa.
@chrishkchris , in fact, even disabling reorder part, the performance is still the same
Hi @chrishkchris , I am still working on this issue, but I am encountering something very weird.
I borrow the code from dnnl example and testing result shows similar performance compared to mkldnn(conv forward in 5000microsec).
However, after some modification to the example code(just introducing Singa Tensor dcslin/singa@06ec6ce), I still can get reasonable performance(conv forward in 5000 microsec). But the weird part is, the result is not stable, out of 10 times testing, segmentation fault appear ~6 times.
I run your example code but did not find any problem, it display mytestok. The last segmentation fault of "double free" is due to freeing of variables.
[ RUN ] MYTEST.Forward
[total]Time difference = 58228[mu s]
[avg]Time difference = 582[mu s]
mytestok
double free or corruption (!prev)
Aborted (core dumped)
do you how to fix this?OK now, I tried by turning the tensor to pointer.
Tensor *in = new Tensor(Shape{batch, in_chan, image_h, image_h}); Tensor *out = new Tensor(Shape{batch, out_chan, out_size, out_size}); Tensor *weights = new Tensor(Shape{out_chan, in_chan, ker, ker}); Tensor *bias = new Tensor(Shape{out_chan}); in->SetValue(1.0f); weights->SetValue(1.0f); bias->SetValue(1.0f);
Now it display:
[----------] 1 test from MYTEST [ RUN ] MYTEST.Forward [total]Time difference = 51943[mu s] [avg]Time difference = 519[mu s] mytestok [ OK ] MYTEST.Forward (53 ms) [----------] 1 test from MYTEST (53 ms total)
so I suspect that when it destruct dnnl memory object, it will free the memory from in->block()->mutable_data …
For the same reason, I suspect that in the constructor of dnnl memory object, it will reallocate memory based on the pointer given.The solution is to avoid passing the block memory pointer to the dnnl memory constructor, instead we copy the block memory to the dnnl memory after dnnl memory object is constructed.
will it cost double memory usage?
from singa.
somehow i fixed conv forward,
dcslin@8e85fdc
it's now in the magnitude of ~5000microsec which is similar to mkldnn.
I see backward also has problem, still checking
from singa.
somehow i fixed conv forward,
dcslin/singa@8e85fdc
it's now in the magnitude of ~5000microsec which is similar to mkldnn.
I see backward also has problem, still checking
I have read the updated code
so the error is caused by dnnl::prop_kind::forward?
from singa.
somehow i fixed conv forward,
dcslin/singa@8e85fdc
it's now in the magnitude of ~5000microsec which is similar to mkldnn.
I see backward also has problem, still checkingI have read the updated code
so the error is caused by dnnl::prop_kind::forward?
not sure
from singa.
@chrishkchris could you please help to test the performance of this branch? it looks ok at my side.
let me know if the performance is ok?
https://github.com/dcslin/singa/tree/dnnl-perf-issue
I will clean the code
from singa.
@chrishkchris could you please help to test the performance of this branch? it looks ok at my side.
let me know if the performance is ok?
https://github.com/dcslin/singa/tree/dnnl-perf-issue
I will clean the code
Thanks a lot! I will test and let you know
from singa.
@dcslin Thank you so much!
It is much much faster than before. The training loss is correct.
root@71b7b910ae0b:~/dcsysh/singa/examples/autograd# python3 mnist_cnn.py
Starting Epoch 0:
Training loss = 577.733337, training accuracy = 0.796858
Evaluation accuracy = 0.937300, Elapsed Time = 139.402045s
Starting Epoch 1:
Training loss = 234.865036, training accuracy = 0.920891
Evaluation accuracy = 0.954327, Elapsed Time = 138.340753s
Starting Epoch 2:
Training loss = 174.085022, training accuracy = 0.941402
Evaluation accuracy = 0.973157, Elapsed Time = 138.636959s
Starting Epoch 3:
Training loss = 137.803268, training accuracy = 0.953475
Evaluation accuracy = 0.966246, Elapsed Time = 138.846091s
Starting Epoch 4:
Training loss = 116.644051, training accuracy = 0.961963
Evaluation accuracy = 0.974459, Elapsed Time = 139.520643s
Starting Epoch 5:
Training loss = 104.833313, training accuracy = 0.964915
Evaluation accuracy = 0.976562, Elapsed Time = 139.725914s
Starting Epoch 6:
Training loss = 95.696701, training accuracy = 0.967466
Evaluation accuracy = 0.979066, Elapsed Time = 141.454090s
Starting Epoch 7:
Training loss = 87.961937, training accuracy = 0.969784
Evaluation accuracy = 0.983474, Elapsed Time = 137.189445s
Starting Epoch 8:
Training loss = 81.685951, training accuracy = 0.972002
Evaluation accuracy = 0.982772, Elapsed Time = 138.908067s
Starting Epoch 9:
Training loss = 76.445320, training accuracy = 0.973986
Evaluation accuracy = 0.983073, Elapsed Time = 136.998992s
from singa.
@dcslin I guess you will need to change the test_operation_convolution.cc back
from singa.
@dcslin I guess you will need to change the test_operation_convolution.cc back
yes sure
from singa.
The code is great! It resolved the problem. I think you can send the PR
from singa.
This issue can be resolved by PR #605
from singa.
Related Issues (20)
- Switch between CPU and GPU devices for cnn example HOT 4
- Save the downloaded datasets to local directory HOT 2
- Add running scripts for cnn and cifar_distributed_cnn examples HOT 4
- Intermediate information printing HOT 3
- Adding arguments for weight decay and momentum HOT 2
- Increase max epoch for cnn example for better convergence HOT 2
- Update CMakeLists.txt for release 4.0.0 HOT 1
- Check Apache license header for release 4.0.0
- OpenCL Compilation Fails
- Upload Release 4.0.0 Package to SVN HOT 1
- Update the NOTICE file for images HOT 1
- gitignore and gitmodules should be removed from the release tar file HOT 2
- Create a new branch dev-postgresql HOT 2
- Create the SumError New Loss Function HOT 1
- Dynamic Creation of Models HOT 2
- Need to return the gradients from optimizer HOT 4
- Maximum recursion depth exceeded in comparison for string HOT 1
- can sparse all-reduce keep efficiency with large number of gpu workers?
- Python 3.11, Model, ImportError HOT 9
- Update documentation for distributed training HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from singa.