Hello, I have quantized my own model and am running it on a nRF52840

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

CMSIS backend performance & perdiction difference about nnom HOT 16 CLOSED

Mjonir commented on May 26, 2024

CMSIS backend performance & perdiction difference

from nnom.

Comments (16)

majianjia commented on May 26, 2024

Hi @Mjonir
Thanks for your interest,
For the "arm_convolve_HWC_q15_basic_nonsquare" there was a mistake i will fixed in next commit. Comment it out doesnt not affect the current functions.

It looks like your CMSIS NN havent use the optimization part inside these 'opt' functions which is the marco ARM_MATH_DSP.
For example, you need to enable the below global marco for CMSIS to enable ARM_MATH_DSP.
'`ARM_MATH_CM4, __FPU_PRESENT=1'
There is some info in the optimization guide under hwc format section. More info can be found in CMSIS's documentation website.

For the difference between local and CMSIS, i didint see a large difference from this similar example to your cases but smaller. But i think it is possible there are some bugs in the local version. The performance different between CMSIS and local is 4x in this example.
Could you provide more info? such as the weights.h, how is your input ranges,

You can use layer_callback() to print the intermidiate activations of each layer then compare the cmsis version and local version.

nnom/examples/auto_test/main.c

Line 67 in ce83a10

//model_set_callback(model, callback);

I will try different setup and see if i can replicate the problem you met.
Thanks
Jianjia

from nnom.

Mjonir commented on May 26, 2024

Hello again,

ARM_MATH_DSP is indeed enabled (through __ARM_FEATURE_DSP being a built-in define by the compiler, and then in arm_math_types.h:88) and I have confirmed with the debugger that the code steps through the corresponding part of arm_convolve_HWC_q7_fast_nonsquare with DSP enabled.

I'm attaching the quantized model and the input vectors used for testing and generating the logs of the previous post so you can run it on your end:
model.zip
I'm using the latest revision of NNoM master branch, and latest revision of CMSIS develop branch. Input (1D time series of signed sensor data) didn't need to be quantized as it is originally encoded in int8_t format with use of the full range.

I will look more into the CMSIS-NN documentation and examples to see if I can find something out of place performance-wise. It would be great to know if you observe the same downgrade/difference if you have the time.

Thanks again for your fast support work !

from nnom.

majianjia commented on May 26, 2024

HI, I tested the performance on your attached model (weights.h) with/without CMSIS enable on a STM32L476 @ 140MHz. You can convert the performance on NRF52840, 64MHz

With CMSIS NN enable and DSP enable

Summary:
Total ops (MAC): 14103872(14.10M)
Prediction time :406655us
Efficiency 34.68 ops/us
Total memory:47152
Total Memory cost (Network and NNoM): 47152

With Local backend.

Summary:
Total ops (MAC): 14103872(14.10M)
Prediction time :1499966us
Efficiency 9.40 ops/us
Total memory:49648
Total Memory cost (Network and NNoM): 49648

You can see CMSIS is stil many times faster than the local C.
You might not successfully enabled the CMSIS NN's DSP support. I remember once in nRF52832, it took some extra steps to do which i had to enable FPU to use the DSP from ARM. And the FPU brought a lot of issues preventing the device to sleep. Anyway, it is unrelated.

I havent tested your data yet.
Will update to you later.

from nnom.

majianjia commented on May 26, 2024

I checked your 3 logs between Keras/CMSIS/Local but i am not sure I understand it. Could you explains abit what you wanted to achieve?

Your output data is with decimal bit = 1 (shown in the weights.h #define DENSE_2_OUTPUT_DEC 1)
So when you want to compare to the keras please remember to convert the number back to floating point, the actual output should be model_output / 2^1

from nnom.

Mjonir commented on May 26, 2024

Hello,

Thanks again for the comparison. I will dig deeper into the microcontroller configuration today and report back if I find anything.

As for the output, the network is a 2-classes classifier. Each row of output are the raw output of the network for each class (before softmax). Since softmax is monotonous, the outputs can be directly compared to each other and the higher of the two gives the predicted class. In the example, both the raw Keras model and the CMSIS implementation give the same predictions, with a count of 7 for class 1 and 15 for class 2 (and indeed the output of the CMSIS implementation is roughly round(keras_output*2). In the log, the raw network outputs are displayed on the right and the cumulative count for each class on the left. The local implementation results are quite different in values. In the more dramatic cases of samples 8, 10 and 13 the prediction is even flipped from class 2 to class 1.

We also end up with a lot of tied values (-1|-1) which do not allow for a prediction, which happen much less often in the CMSIS implementation, and in further testing only happens cases where the network predicts 0 for both classes. This might be easily solved with a higher value resolution, which is why I was asking about manually tweaking the quantization earlier (we can allow quite a lot of saturation here). In the meantime I did find the option quantize_method='kld' but the results are pretty bad on my use case (it doesn't change the quantization of the output anyway, and changes the one of other layers completely destroying the predictions of the network which always ends up predicting the same outputs). I also naively tried to modify the DENSE_2_OUTPUT_DEC define but this doesn't seem to change the output values.

from nnom.

Mjonir commented on May 26, 2024

Hello,

I made some progress on the performance side of the problem by isolating it to the version of CMSIS in use. As you can see in the attached file the performance with 5.4.0 does yield the expected 5x increase in performance. However as I update to newer revisions the performance drops slightly, with a catastrophic drop with upgrading from 5.6.0 to 5.7.0:
Comparison of CMSIS libraries performance

May I ask which version of CMSIS you have been using in testing, and in particular if you have tested with CMSIS 5.7.0 and above?

I will keep digging into the difference between these revisions and report back when I further isolate the issue.

from nnom.

majianjia commented on May 26, 2024

Hi @Mjonir
Sorry I was too busy this week and might be also busy next week.
I check my enviroment, which is using CMSIS 5.7.0 (release) and with NN lib 1.3.0, and DSP 1.8.0 (did you also include the DSP lib?)
With these global macro enable USE_HAL_DRIVER, STM32L476xx, ARM_MATH_CM4, RT_USING_ARM_LIBC, __FPU_PRESENT=1

It is very strange that you could enable the acceleration only one 5.7.0.

NNoM currently working on the old CMSIS-NN interface, which was freezed from a few versions ago, so the version shouldnt have large effect on there.
The future development of nnom will move to the new interface once they are stable and complete.

If you have per_channel_quant enable, it will use local for Conv because the old CMSIS-NN dosent support that. But i doubt this is the case for the issue you met, since you still have the acceleration on older version of CMSIS.

from nnom.

majianjia commented on May 26, 2024

For accuracy, I didn't have time to test your data.
But sometimes training with too many epoch might be the cause of losing accuracy. because the extreme values affect the quantisation result. so the majority of data didn't get enough resolution.

I will try to compare the output between local and CMSIS to locate the issue. Also please always convert the input data to -1 to 1 (necessary), and the output data if you can (optional).
This will help the quantisation process, to prevent large shift in the first Conv layer.

from nnom.

Mjonir commented on May 26, 2024

Hello,

No worries, I very much appreciate the time you're taking to help me with this issue.

To be clear, CMSIS versions up to and including 5.6.0 work with acceleration, while versions 5.7.0 and above do not work. I do include the DSP headers, but no ".c" file. The prebuilt library (libarm_cortexM4lf_math.a) does not seem necessary. Even if I do link it as a test, the result is identical for both revisions. I have not enabled the per_channel_quant option. I have also tried replicating your global macros with no change to the result either (I was already using ARM_MATH_CM4 and __FPU_PRESENT).

I have raised a CMSIS issue since the problem seems to come solely from the different implementations between CMSIS 5.6.0 and 5.7.0 with no interface change.

But sometimes training with too many epoch might be the cause of losing accuracy. because the extreme values affect the quantisation result. so the majority of data didn't get enough resolution.

The output of my network indeed has some extreme values leading to low resolution of the quantized outputs. Besides changing the training, would there be a way to manually tweak the quantization to increase the output resolution (saturating the extreme values is okay)?

Also please always convert the input data to -1 to 1 (necessary), and the output data if you can (optional). This will help the quantisation process, to prevent large shift in the first Conv layer.

To be sure I understand: My input is already naturally quantized in int8 format ([-128;+127]). Are you suggesting that I change it as a float in [-1;1] range, retrain the network, and quantize that model instead?

Also, how would you suggest I do the same with the output? My original network had a final Softmax() layer, however it caused numerical instabilities during training due a Log() being applied in my loss function. I thus changed it so that the model outputs its raw values and I apply the more stable LogSoftmax() in the loss function.

Thank you again !

Yes, please convert the data for training. It will help the conv calculation.
then you can use the normal one in MCU for inference (it is the same as what you did).

(From Jianjia: sorry it was my mistake to edit your comment, but i was willing to create a new comment)

from nnom.

majianjia commented on May 26, 2024

To be sure I understand: My input is already naturally quantized in int8 format ([-128;+127]). Are you suggesting that I change it as a float in [-1;1] range, retrain the network, and quantize that model instead?

Yes, please convert the data for training. It will help the conv calculation.
Then you can use the normal one in MCU for inference (it is the same as what you did, no need to change the MCU side).

from nnom.

Mjonir commented on May 26, 2024

I attempted to reduce the input range to [-1;1] as suggested. With the CMSIS backend the predictions and performance are similar to the previous version. Is this expected?

However, with the local backend, predictions degenerate to always giving the same values. I attach the weights file and logs if you want to test on your end:
float_test_2.zip

from nnom.

majianjia commented on May 26, 2024

I tested you previous model file and all the output for the first data.
I compared them on VSCode, all data are identical, both are not similar to the output from your TF.
May I know your setting on nnom_port.h
LOCAL.log
CMSISNN.log

The project files are here
project.zip

I will test your new model and see what is the difference. They should be the same.
You can also export the output of each layer using the method I showed in above code, then prabably compare to TF output?

from nnom.

Mjonir commented on May 26, 2024

Hello again,

I generated the activation logs for both and looking at them it quickly became apparent that it was a rounding difference. I had "NNOM_TRUNCATE" defined (as per the default settings), undefining it leads to both implementations returning the same activations, and the correct predictions:
Activation logs and fixed settings

It wasn't clear to me that NNOM_TRUNCATE would only affect the local backend. Is there a reason for it to be enabled by default?

from nnom.

majianjia commented on May 26, 2024

it looks like a mistakes, I check the latest commit related to NNOM_TRUNCATE where I was trying to synchronise the options between local and CMSIS. 7bc2bec

But I just found out that CMSIS uses floor as default. So I did it on the opposite way.
Could just delete this option from port.h since we are using floor in quantising data in the script (if I remember correctly).

Thanks for helping me to locate this issue. Will update it in my next commit.

from nnom.

Mjonir commented on May 26, 2024

As answered in the CMSIS issue, the "-fno-builtin" compiler option prevents the compiler from optimizing memcpy/memset. Removing this option results in the expected performance increase with CMSIS 5.7.0+

from nnom.

majianjia commented on May 26, 2024

Thank you, good to know @Mjonir

from nnom.

CMSIS backend performance & perdiction difference about nnom HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent