Profiling IF_curve_LIF.py
and IF_curve_Hodgkin_Huxley.py
examples from brian2
shows that the performance bottleneck is the spikemonitor kernel.
Below are profiling results (using nvprof
) for three different implementation szenarios.
- The unchanged not-parallel spikespace, using the unchanged spikemonitor (that is assuming a parallel spikespace) (dee9bf7)
- A quick and dirty modified spikemonitor, that now assumes a not-parallel spikespace (and should now also return sorted data automatically) (branch
issue10_sorted_spikemonitor
, 609005d)
- The parallel spikespace (from the slower thresholder implementation with shared
atomicAdd
s from issue #9) with the spikemonitor from 1. (that assumes the parallel spikespace) (branch issue9_spikespace
, eb215d4)
Some of the results using N=10000
neurons in the IF_curve_LIF.py
example:
Kernel |
1. |
2. |
3. |
run_spikemonitor |
280 us (78%) |
250 us (79%) |
213 us (73%) |
copy_spikemonitor |
670 ms (19%) |
540 ms (17%) |
665 ms (23%) |
thresholder |
2.6 us (0.7%) |
2 us (0.8%) |
5.3 us (1.8%) |
times are [average times per kernel call] and percentages are [time spent in that kernel (all kernel calls) / total time]:
That means:
- For all of the above scenarios the spikemonitor kernel is the bottleneck (especially when simulating many neurons) and performance only varies around 20% between them.
- Weather the thresholder is filling the spikespace in parallel or not really doesn't matter, since even for many neurons it takes up blow 2% of the time
- Even though the copy_spikemonitor kernel is only called once, it actually takes quite long if there are a lot of neurons.
My conclusions:
- No need for a parallel spikespace, it doesn't solve the "slowness" of the spikemonitor
- The serialized spikemonitor (2.) might even be a little faster with a little clean up (right now there is some unnecessary copying happening). And then we could advance to other topics
- The following might actually increase performance: In the thresholder just use a register, that is either
-1
(if the neuron didn't spike) or threadID
if it spiked and just write it to the spikespace at idx=threadID
. This way the spikespace has a lot of -1
values inbetween neuronID values, but we don't need any atomics. Then use the thrust::copy_if
function, that returns a vector of all not--1
values from the spikespace and the size of that vector (using some parallel GPU algorithm). And then just add that returned vector to a the spikemonitor (which can then just be thrust::device_vector
). We should talk about this approach.
Below are the detailed profiling results:
Profiling of IF_curve_Hodgkin_Huxley.py
for different number of neurons N
, using the unmodified spikemonitor (that assumes a parallel spikespace) (1.):
HODGKIN HUXLEY, N=100
==10548== Profiling result:
Time(%) Time Calls Avg Min Max Name
52.48% 291.81ms 20000 14.590us 13.440us 17.024us kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, char*, double*, double*, double*, double*)
24.20% 134.56ms 20000 6.7270us 1.6640us 30.271ms _run_spikemonitor_codeobject_kernel(unsigned int, unsigned int, unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
10.57% 58.748ms 60024 978ns 928ns 2.6880us [CUDA memcpy HtoD]
8.40% 46.715ms 20000 2.3350us 1.6960us 2.7840us kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
4.16% 23.111ms 1 23.111ms 23.111ms 23.111ms _copy_spikemonitor_codeobject_kernel(int*, double*, unsigned int)
0.16% 878.66us 1 878.66us 878.66us 878.66us _run_spikemonitor_codeobject_init(unsigned int)
0.01% 80.896us 1 80.896us 80.896us 80.896us _run_debugmsg_spikemonitor_codeobject_kernel(unsigned int)
0.01% 77.632us 18 4.3120us 2.0800us 26.752us [CUDA memcpy DtoH]
0.00% 5.3120us 1 5.3120us 5.3120us 5.3120us _count_spikemonitor_codeobject_kernel(double*, int*, int*, int*, int, int*, double*, int, int*, int*, unsigned int, unsigned int*)
0.00% 3.8080us 1 3.8080us 3.8080us 3.8080us _kernel_neurongroup_group_variable_set_conditional_codeobject(unsigned int, unsigned int, int*, double*)
HODGKIN HUXLEY, N=1000
==31099== Profiling result:
Time(%) Time Calls Avg Min Max Name
57.69% 828.03ms 20000 41.401us 1.5680us 225.32ms _run_spikemonitor_codeobject_kernel(unsigned int, unsigned int, unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
18.55% 266.23ms 20000 13.311us 13.088us 16.128us kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, char*, double*, double*, double*, double*)
16.26% 233.40ms 1 233.40ms 233.40ms 233.40ms _copy_spikemonitor_codeobject_kernel(int*, double*, unsigned int)
4.04% 57.938ms 60024 965ns 928ns 3.1680us [CUDA memcpy HtoD]
3.36% 48.259ms 20000 2.4120us 1.7280us 3.0720us kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
0.06% 876.13us 1 876.13us 876.13us 876.13us _run_spikemonitor_codeobject_init(unsigned int)
0.03% 434.05us 18 24.113us 2.1120us 251.49us [CUDA memcpy DtoH]
0.01% 81.472us 1 81.472us 81.472us 81.472us _run_debugmsg_spikemonitor_codeobject_kernel(unsigned int)
0.00% 5.2480us 1 5.2480us 5.2480us 5.2480us _count_spikemonitor_codeobject_kernel(double*, int*, int*, int*, int, int*, double*, int, int*, int*, unsigned int, unsigned int*)
Profiling of IF_curve_LIF.py
for different number of neurons `N``, using the unmodified spikemonitor (that assumes a parallel spikespace) (1.):
LIF, N=1000
==32636== Profiling result:
Time(%) Time Calls Avg Min Max Name
58.81% 235.41ms 10000 23.540us 1.3760us 60.282ms _run_spikemonitor_codeobject_kernel(unsigned int, unsigned int, unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
16.55% 66.264ms 1 66.264ms 66.264ms 66.264ms _copy_spikemonitor_codeobject_kernel(int*, double*, unsigned int)
8.37% 33.502ms 10000 3.3500us 3.1680us 4.3200us kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, double*, double*, double*, char*)
7.30% 29.222ms 30021 973ns 928ns 4.8640us [CUDA memcpy HtoD]
5.08% 20.352ms 10000 2.0350us 1.4400us 2.6240us kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
3.60% 14.400ms 10000 1.4400us 1.3120us 1.9200us kernel_neurongroup_resetter_codeobject(unsigned int, int*, char*, double*)
0.22% 879.91us 1 879.91us 879.91us 879.91us _run_spikemonitor_codeobject_init(unsigned int)
0.04% 151.84us 15 10.122us 2.1120us 73.184us [CUDA memcpy DtoH]
0.02% 81.056us 1 81.056us 81.056us 81.056us _run_debugmsg_spikemonitor_codeobject_kernel(unsigned int)
0.00% 5.5360us 1 5.5360us 5.5360us 5.5360us _count_spikemonitor_codeobject_kernel(double*, int*, int*, int*, int, int*, double*, int, int*, int*, unsigned int, unsigned int*)
0.00% 3.3920us 1 3.3920us 3.3920us 3.3920us _kernel_neurongroup_group_variable_set_conditional_codeobject(unsigned int, unsigned int, double*, int*)
LIF, N=10000
==1277== Profiling result:
Time(%) Time Calls Avg Min Max Name
78.12% 2.79040s 10000 279.04us 1.2800us 901.28ms _run_spikemonitor_codeobject_kernel(unsigned int, unsigned int, unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
18.71% 668.47ms 1 668.47ms 668.47ms 668.47ms _copy_spikemonitor_codeobject_kernel(int*, double*, unsigned int)
1.11% 39.806ms 10000 3.9800us 3.9040us 5.4400us kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, double*, double*, double*, char*)
0.82% 29.177ms 30021 971ns 928ns 28.000us [CUDA memcpy HtoD]
0.72% 25.781ms 10000 2.5780us 1.7600us 3.4240us kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
0.44% 15.810ms 10000 1.5810us 1.5040us 2.2400us kernel_neurongroup_resetter_codeobject(unsigned int, int*, char*, double*)
0.05% 1.7333ms 15 115.56us 2.1120us 1.2212ms [CUDA memcpy DtoH]
0.02% 881.92us 1 881.92us 881.92us 881.92us _run_spikemonitor_codeobject_init(unsigned int)
0.00% 81.280us 1 81.280us 81.280us 81.280us _run_debugmsg_spikemonitor_codeobject_kernel(unsigned int)
After a quick and dirty change of the spikemonitor (adapted for not parallel spikespace) (2.):
LIF, N=1000, SERIALIZED SPIKEMONITOR
==15351== Profiling result:
Time(%) Time Calls Avg Min Max Name
58.65% 214.83ms 10000 21.482us 1.5360us 55.489ms _run_spikemonitor_codeobject_kernel(unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
14.67% 53.725ms 1 53.725ms 53.725ms 53.725ms _copy_spikemonitor_codeobject_kernel(int*, double*, bool)
9.15% 33.501ms 10000 3.3500us 3.1360us 4.3520us kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, double*, double*, double*, char*)
7.98% 29.222ms 30021 973ns 928ns 7.1680us [CUDA memcpy HtoD]
5.56% 20.374ms 10000 2.0370us 1.4400us 2.6240us kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
3.91% 14.317ms 10000 1.4310us 1.3440us 1.9520us kernel_neurongroup_resetter_codeobject(unsigned int, int*, char*, double*)
0.04% 151.36us 15 10.090us 2.1120us 73.280us [CUDA memcpy DtoH]
0.02% 77.632us 1 77.632us 77.632us 77.632us _run_debugmsg_spikemonitor_codeobject_kernel(void)
0.01% 53.280us 1 53.280us 53.280us 53.280us _run_spikemonitor_codeobject_init(void)
0.00% 3.2960us 1 3.2960us 3.2960us 3.2960us _kernel_neurongroup_group_variable_set_conditional_codeobject(unsigned int, unsigned int, double*, int*)
LIF, N=10000, SERIALIZED SPIKEMONITOR
==11854== Profiling result:
Time(%) Time Calls Avg Min Max Name
79.31% 2.51796s 10000 251.80us 1.5680us 827.75ms _run_spikemonitor_codeobject_kernel(unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
17.06% 541.50ms 1 541.50ms 541.50ms 541.50ms _copy_spikemonitor_codeobject_kernel(int*, double*, bool)
1.28% 40.739ms 10000 4.0730us 3.9680us 5.3760us kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, double*, double*, double*, char*)
0.92% 29.247ms 30021 974ns 928ns 28.032us [CUDA memcpy HtoD]
0.81% 25.818ms 10000 2.5810us 1.6640us 3.4240us kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
0.56% 17.714ms 10000 1.7710us 1.6000us 2.1440us kernel_neurongroup_resetter_codeobject(unsigned int, int*, char*, double*)
0.05% 1.7398ms 15 115.99us 2.1120us 1.2275ms [CUDA memcpy DtoH]
0.00% 76.928us 1 76.928us 76.928us 76.928us _run_debugmsg_spikemonitor_codeobject_kernel(void)
0.00% 53.568us 1 53.568us 53.568us 53.568us _run_spikemonitor_codeobject_init(void)
Using the parallel spikespace (filled in parallel by the modified thresholder) and the parallel spikemonitor (3.):
LIF, N=1000, PARALLEL THRESHOLDER
==16310== Profiling result:
Time(%) Time Calls Avg Min Max Name
60.58% 268.46ms 10000 26.846us 1.3760us 15.050ms _run_spikemonitor_codeobject_kernel(unsigned int, unsigned int, unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
15.26% 67.602ms 1 67.602ms 67.602ms 67.602ms _copy_spikemonitor_codeobject_kernel(int*, double*, unsigned int)
8.03% 35.578ms 10000 3.5570us 3.2960us 4.4160us kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, double*, double*, double*, char*)
6.64% 29.410ms 30021 979ns 928ns 3.5200us [CUDA memcpy HtoD]
5.62% 24.903ms 10000 2.4900us 1.7280us 5.1520us kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
3.62% 16.056ms 10000 1.6050us 1.4080us 1.8880us kernel_neurongroup_resetter_codeobject(unsigned int, int*, char*, double*)
0.20% 876.83us 1 876.83us 876.83us 876.83us _run_spikemonitor_codeobject_init(unsigned int)
0.03% 151.74us 15 10.116us 2.1120us 73.248us [CUDA memcpy DtoH]
0.02% 81.664us 1 81.664us 81.664us 81.664us _run_debugmsg_spikemonitor_codeobject_kernel(unsigned int)
0.00% 5.6640us 1 5.6640us 5.6640us 5.6640us _count_spikemonitor_codeobject_kernel(double*, int*, int*, int*, int, int*, double*, int, int*, int*, unsigned int, unsigned int*)
0.00% 3.2640us 1 3.2640us 3.2640us 3.2640us _kernel_neurongroup_group_variable_set_conditional_codeobject(unsigned int, unsigned int, double*, int*)
LIF, N=10000, PARALLEL THRESHOLDER
==16906== Profiling result:
Time(%) Time Calls Avg Min Max Name
72.62% 2.13823s 10000 213.82us 1.3760us 112.15ms _run_spikemonitor_codeobject_kernel(unsigned int, unsigned int, unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
22.60% 665.34ms 1 665.34ms 665.34ms 665.34ms _copy_spikemonitor_codeobject_kernel(int*, double*, unsigned int)
1.80% 53.087ms 10000 5.3080us 2.0160us 18.304us kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
1.36% 39.906ms 10000 3.9900us 3.8720us 5.1520us kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, double*, double*, double*, char*)
0.99% 29.177ms 30021 971ns 928ns 28.128us [CUDA memcpy HtoD]
0.54% 15.887ms 10000 1.5880us 1.5360us 1.9840us kernel_neurongroup_resetter_codeobject(unsigned int, int*, char*, double*)
0.06% 1.7550ms 15 117.00us 2.1120us 1.2425ms [CUDA memcpy DtoH]
0.03% 879.14us 1 879.14us 879.14us 879.14us _run_spikemonitor_codeobject_init(unsigned int)
0.00% 81.632us 1 81.632us 81.632us 81.632us _run_debugmsg_spikemonitor_codeobject_kernel(unsigned int)