Comments (16)
Hi @Mjonir
Thanks for your interest,
For the "arm_convolve_HWC_q15_basic_nonsquare" there was a mistake i will fixed in next commit. Comment it out doesnt not affect the current functions.
It looks like your CMSIS NN havent use the optimization part inside these 'opt' functions which is the marco ARM_MATH_DSP
.
For example, you need to enable the below global marco for CMSIS to enable ARM_MATH_DSP
.
'`ARM_MATH_CM4, __FPU_PRESENT=1'
There is some info in the optimization guide under hwc format section. More info can be found in CMSIS's documentation website.
For the difference between local and CMSIS, i didint see a large difference from this similar example to your cases but smaller. But i think it is possible there are some bugs in the local version. The performance different between CMSIS and local is 4x in this example.
Could you provide more info? such as the weights.h
, how is your input ranges,
You can use layer_callback()
to print the intermidiate activations of each layer then compare the cmsis version and local version.
nnom/examples/auto_test/main.c
Line 67 in ce83a10
I will try different setup and see if i can replicate the problem you met.
Thanks
Jianjia
from nnom.
Hello again,
ARM_MATH_DSP
is indeed enabled (through __ARM_FEATURE_DSP
being a built-in define by the compiler, and then in arm_math_types.h:88) and I have confirmed with the debugger that the code steps through the corresponding part of arm_convolve_HWC_q7_fast_nonsquare
with DSP enabled.
I'm attaching the quantized model and the input vectors used for testing and generating the logs of the previous post so you can run it on your end:
model.zip
I'm using the latest revision of NNoM master branch, and latest revision of CMSIS develop branch. Input (1D time series of signed sensor data) didn't need to be quantized as it is originally encoded in int8_t format with use of the full range.
I will look more into the CMSIS-NN documentation and examples to see if I can find something out of place performance-wise. It would be great to know if you observe the same downgrade/difference if you have the time.
Thanks again for your fast support work !
from nnom.
HI, I tested the performance on your attached model (weights.h) with/without CMSIS enable on a STM32L476 @ 140MHz. You can convert the performance on NRF52840, 64MHz
With CMSIS NN enable and DSP enable
Summary:
Total ops (MAC): 14103872(14.10M)
Prediction time :406655us
Efficiency 34.68 ops/us
Total memory:47152
Total Memory cost (Network and NNoM): 47152
With Local backend.
Summary:
Total ops (MAC): 14103872(14.10M)
Prediction time :1499966us
Efficiency 9.40 ops/us
Total memory:49648
Total Memory cost (Network and NNoM): 49648
You can see CMSIS is stil many times faster than the local C.
You might not successfully enabled the CMSIS NN's DSP support. I remember once in nRF52832, it took some extra steps to do which i had to enable FPU to use the DSP from ARM. And the FPU brought a lot of issues preventing the device to sleep. Anyway, it is unrelated.
I havent tested your data yet.
Will update to you later.
from nnom.
I checked your 3 logs between Keras/CMSIS/Local but i am not sure I understand it. Could you explains abit what you wanted to achieve?
Your output data is with decimal bit = 1 (shown in the weights.h #define DENSE_2_OUTPUT_DEC 1
)
So when you want to compare to the keras please remember to convert the number back to floating point, the actual output should be model_output / 2^1
from nnom.
Hello,
Thanks again for the comparison. I will dig deeper into the microcontroller configuration today and report back if I find anything.
As for the output, the network is a 2-classes classifier. Each row of output are the raw output of the network for each class (before softmax). Since softmax is monotonous, the outputs can be directly compared to each other and the higher of the two gives the predicted class. In the example, both the raw Keras model and the CMSIS implementation give the same predictions, with a count of 7 for class 1 and 15 for class 2 (and indeed the output of the CMSIS implementation is roughly round(keras_output*2)
. In the log, the raw network outputs are displayed on the right and the cumulative count for each class on the left. The local implementation results are quite different in values. In the more dramatic cases of samples 8, 10 and 13 the prediction is even flipped from class 2 to class 1.
We also end up with a lot of tied values (-1|-1) which do not allow for a prediction, which happen much less often in the CMSIS implementation, and in further testing only happens cases where the network predicts 0 for both classes. This might be easily solved with a higher value resolution, which is why I was asking about manually tweaking the quantization earlier (we can allow quite a lot of saturation here). In the meantime I did find the option quantize_method='kld'
but the results are pretty bad on my use case (it doesn't change the quantization of the output anyway, and changes the one of other layers completely destroying the predictions of the network which always ends up predicting the same outputs). I also naively tried to modify the DENSE_2_OUTPUT_DEC
define but this doesn't seem to change the output values.
from nnom.
Hello,
I made some progress on the performance side of the problem by isolating it to the version of CMSIS in use. As you can see in the attached file the performance with 5.4.0 does yield the expected 5x increase in performance. However as I update to newer revisions the performance drops slightly, with a catastrophic drop with upgrading from 5.6.0 to 5.7.0:
Comparison of CMSIS libraries performance
May I ask which version of CMSIS you have been using in testing, and in particular if you have tested with CMSIS 5.7.0 and above?
I will keep digging into the difference between these revisions and report back when I further isolate the issue.
from nnom.
Hi @Mjonir
Sorry I was too busy this week and might be also busy next week.
I check my enviroment, which is using CMSIS 5.7.0 (release) and with NN lib 1.3.0, and DSP 1.8.0 (did you also include the DSP lib?)
With these global macro enable USE_HAL_DRIVER, STM32L476xx, ARM_MATH_CM4, RT_USING_ARM_LIBC, __FPU_PRESENT=1
It is very strange that you could enable the acceleration only one 5.7.0.
NNoM currently working on the old CMSIS-NN interface, which was freezed from a few versions ago, so the version shouldnt have large effect on there.
The future development of nnom will move to the new interface once they are stable and complete.
If you have per_channel_quant
enable, it will use local for Conv because the old CMSIS-NN dosent support that. But i doubt this is the case for the issue you met, since you still have the acceleration on older version of CMSIS.
from nnom.
For accuracy, I didn't have time to test your data.
But sometimes training with too many epoch might be the cause of losing accuracy. because the extreme values affect the quantisation result. so the majority of data didn't get enough resolution.
I will try to compare the output between local and CMSIS to locate the issue. Also please always convert the input data to -1 to 1 (necessary), and the output data if you can (optional).
This will help the quantisation process, to prevent large shift in the first Conv layer.
from nnom.
Hello,
No worries, I very much appreciate the time you're taking to help me with this issue.
To be clear, CMSIS versions up to and including 5.6.0 work with acceleration, while versions 5.7.0 and above do not work. I do include the DSP headers, but no ".c" file. The prebuilt library (libarm_cortexM4lf_math.a) does not seem necessary. Even if I do link it as a test, the result is identical for both revisions. I have not enabled the per_channel_quant option. I have also tried replicating your global macros with no change to the result either (I was already using ARM_MATH_CM4 and __FPU_PRESENT).
I have raised a CMSIS issue since the problem seems to come solely from the different implementations between CMSIS 5.6.0 and 5.7.0 with no interface change.
But sometimes training with too many epoch might be the cause of losing accuracy. because the extreme values affect the quantisation result. so the majority of data didn't get enough resolution.
The output of my network indeed has some extreme values leading to low resolution of the quantized outputs. Besides changing the training, would there be a way to manually tweak the quantization to increase the output resolution (saturating the extreme values is okay)?
Also please always convert the input data to -1 to 1 (necessary), and the output data if you can (optional). This will help the quantisation process, to prevent large shift in the first Conv layer.
To be sure I understand: My input is already naturally quantized in int8 format ([-128;+127]). Are you suggesting that I change it as a float in [-1;1] range, retrain the network, and quantize that model instead?
Also, how would you suggest I do the same with the output? My original network had a final Softmax() layer, however it caused numerical instabilities during training due a Log() being applied in my loss function. I thus changed it so that the model outputs its raw values and I apply the more stable LogSoftmax() in the loss function.
Thank you again !
Yes, please convert the data for training. It will help the conv calculation.
then you can use the normal one in MCU for inference (it is the same as what you did).
(From Jianjia: sorry it was my mistake to edit your comment, but i was willing to create a new comment)
from nnom.
To be sure I understand: My input is already naturally quantized in int8 format ([-128;+127]). Are you suggesting that I change it as a float in [-1;1] range, retrain the network, and quantize that model instead?
Yes, please convert the data for training. It will help the conv calculation.
Then you can use the normal one in MCU for inference (it is the same as what you did, no need to change the MCU side).
from nnom.
I attempted to reduce the input range to [-1;1] as suggested. With the CMSIS backend the predictions and performance are similar to the previous version. Is this expected?
However, with the local backend, predictions degenerate to always giving the same values. I attach the weights file and logs if you want to test on your end:
float_test_2.zip
from nnom.
I tested you previous model file and all the output for the first data.
I compared them on VSCode, all data are identical, both are not similar to the output from your TF.
May I know your setting on nnom_port.h
LOCAL.log
CMSISNN.log
The project files are here
project.zip
I will test your new model and see what is the difference. They should be the same.
You can also export the output of each layer using the method I showed in above code, then prabably compare to TF output?
from nnom.
Hello again,
I generated the activation logs for both and looking at them it quickly became apparent that it was a rounding difference. I had "NNOM_TRUNCATE" defined (as per the default settings), undefining it leads to both implementations returning the same activations, and the correct predictions:
Activation logs and fixed settings
It wasn't clear to me that NNOM_TRUNCATE would only affect the local backend. Is there a reason for it to be enabled by default?
from nnom.
it looks like a mistakes, I check the latest commit related to NNOM_TRUNCATE where I was trying to synchronise the options between local and CMSIS. 7bc2bec
But I just found out that CMSIS uses floor as default. So I did it on the opposite way.
Could just delete this option from port.h since we are using floor in quantising data in the script (if I remember correctly).
Thanks for helping me to locate this issue. Will update it in my next commit.
from nnom.
As answered in the CMSIS issue, the "-fno-builtin" compiler option prevents the compiler from optimizing memcpy/memset. Removing this option results in the expected performance increase with CMSIS 5.7.0+
from nnom.
Thank you, good to know @Mjonir
from nnom.
Related Issues (20)
- Incosistent Accuracy Between Python and C Implementation HOT 2
- 关于记忆性 HOT 4
- 报错 File "ptq_ns/nnom-master/scripts/nnom.py", line 1019, in generate_model inX += ' ,layer[%d]' % (LI[inp][0]) KeyError: 'tf.__operators__.getitem_5'
- about reshape HOT 3
- 使用了per channel量化和kld量化方法后,出现了多次推理结果不一致的问题
- 关于输出维度的问题 HOT 4
- nnom静态内存支持如何打开? HOT 6
- main_pc.c中的test_x.txt和text_y.txt如何制作?
- 关于识别几秒时长的语音 HOT 3
- inhomogeneous shape after 1 dimensions
- keyword_spotting中的main_pc.c中的test_x.txt和text_y.txt如何制作?
- nothing
- 关于使用DW_Conv2D与Conv2D的移植后使用耗时的问题 HOT 1
- 使用#define NNOM_USING_CMSIS_NN 的错误。 HOT 2
- 运行rnnnoise的example报错 HOT 2
- Error: Deprecated Usage of `np.int` in `to_cstyle` Function in `gen_config.py`
- Keras 3 version compatibility HOT 1
- Installable using pip
- Exception with Tensorflow 2.16 HOT 2
- Consultation on kws examples?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nnom.