mn416 / qpulib Goto Github PK

View Code? Open in Web Editor NEW

428.0 30.0 64.0 1002 KB

Language and compiler for the Raspberry Pi GPU

License: Other

C++ 89.96% C 9.09% Makefile 0.95%

raspberry-pi compiler qpu gpu vector

qpulib's People

Contributors

Stargazers

Watchers

qpulib's Issues

Not an issue!

Wow, very cool to find this and learn about this vector processing capabilities of the RPi!
Now, just need to clarify further what applications this would be great for. I.e. what type of programs would benefit from vectorisation?

Support for unsigned integers

Question received by email:

Hi, I would like to ask, if there is a way to use unsigned int with QPULib. I admit that I am new to all this, but I just wanted to implement SHA-256 with it and this seems to be my current problem.

I just pushed a function shr that will do an unsigned shift-right: 52ceae5.

E.g. in a kernel, you can now write:

Int a = 0x80000000;
Int b = a >> 1; // Sign-preserving shift
Int c = shr(a, 1); // Unsigned shift

Is this sufficient? (I believe all the other integer operations are equivalent for signed v unsigned, except for multiplication, for which the QPUs only support 24-bit unsigned multiplication IIRC.)

If it's of any help, I also pushed an ror function for bitwise right-rotation (supported natively by the QPUs): 09bc62d.

read after write by gather/receive doesn't work

I found if I do gather/receive some ptr after just stored it, the reading would fail. Even if I call twice of store or use assign, or use flush to make sure the store is sync'ed.

If I do result = *(Ptr) it would work.

Is it intentionallly not supported or is it bug?

I tried the same code in emulate mode, it works correctly.

On generating technical documentation for source code

@mn416 I did a test run with doxygen over the code in Lib. I put the result on my webserver for your perusal: QPULib documentation.

The thing is, it's very sparse, because no comments are actually present at the moment. Just about the only commented thing is Kernel::pretty(), which I added.

So, I'm actually wondering if this is useful at all. It only makes sense if we consistently add the comments. I honestly don't feel like doing this; I would prefer to add comments as I go, to document the understanding of the code as I see it right at that moment. I'm expecting that you don't want to add comments retroactively either.

I propose to just skip the document generation for now. I will put it as a long-term item on the TODO. Is this OK with you?

Makefile wrong -D flag

The QPU_MODE should be enabled if QPU>1 for make argument.
Current Makefile sets -DEMULATION_MODE if QPU>1.

Suggestion on transfer big chunk of memory between CPU and QPU?

We achieved 5+x speed up with QPULib for the main calculation part!

But we have to keep transfering data back and forth between CPU and QPU, which is consuming half of the total time.

Following the examples code, we can only do this by for{*(shared_array_ptr) = a;}, which seems too weak.

I think it's some fundermental operation that woth additional optimization. Any suggestion how to do it efficiently? Like what they do here: https://github.com/nineties/py-videocore/blob/f2a0ef174a936f7a6e11a9e24f34fb555acb84c7/videocore/assembler.py#L692

Cannot Compile

I'm using an updated Archlinux on a Raspberry Pi 1 B+ and compilation attempt is:

[root@SmallPi Tests]# make QPU=1 GCD
Compiling GCD.cpp
In file included from ../Lib/QPULib.h:6,
from GCD.cpp:2:
../Lib/Source/Ptr.h: In copy constructor ‘Ptr::Ptr(Ptr&)’:
../Lib/Source/Ptr.h:63:5: error: there are no arguments to ‘assign’ that depend on a template parameter, so a declaration of ‘assign’ must be available [-fpermissive]
assign(this->expr, x.expr);
^~~~~~
../Lib/Source/Ptr.h:63:5: note: (if you use ‘-fpermissive’, G++ will accept your code, but allowing the use of an undeclared name is deprecated)
../Lib/Source/Ptr.h: In copy constructor ‘Ptr::Ptr(const Ptr&)’:
../Lib/Source/Ptr.h:68:5: error: there are no arguments to ‘assign’ that depend on a template parameter, so a declaration of ‘assign’ must be available [-fpermissive]
assign(this->expr, x.expr);
^~~~~~
../Lib/Source/Ptr.h: In member function ‘Ptr& Ptr::operator=(Ptr&)’:
../Lib/Source/Ptr.h:73:5: error: there are no arguments to ‘assign’ that depend on a template parameter, so a declaration of ‘assign’ must be available [-fpermissive]
assign(this->expr, rhs.expr);
^~~~~~
make: *** [Makefile:121: GCD.o] Error 1

Versions:
[root@SmallPi Tests]# gcc --version
gcc (GCC) 8.2.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
[root@SmallPi Tests]# uname -a
Linux SmallPi 4.14.83-2-ARCH #1 SMP Sat Nov 24 23:09:29 UTC 2018 armv6l GNU/Linux

Request: Busy-loop on kernel execution completion

I would like to be able to busy-loop on kernel completion (much like what GPU_FFT library does when having rather small FFTs), as having the mailbox communication latency is much to large when doing DSP.

Question: Processing larger vectors

I would like to see an example where I have f.i. :

SharedArray<float> values(1024);

and process that with N QPUs. I plan to do time domain FIR with this.

Question: how to transfer values on the regfiles for Special Function Unit

Usage of the Special Function Unit (SFU) is something I would really like to add to QPULib.

Page 23 of th VideoCore Reference tells you how to do it:

Load a given location (e.g. SFU_RECIPSQRT) in the A/B regfile space
Don't read special register r4 for two cycles
Read result from register r4

This is simple in principle, but I would like to know the opcodes/commands to move values between different addresses in the regfiles. Would you mind explaining this to me?

The rest I think I can figure out using the QPULib DSL.

I have to say I'm a bit underwhelmed by the SFU. It appears to deal with one value at a time (as opposed to the regular blocks of 16) and on top of that there is the two-cycle wait. Still, I hope it is of some value.

How many gigaflops?

Looking at the manual, there are 4 slices x 4 QPUs / slice, giving 16 QPUs per GPU.
Each QPU has 2 ALUs, which operate in parallel.. With full pipelines, each QPU executes an instruction on each cycle, giving 2 32-bit floating-point operations per cycle, per QPU.
At a clock speed of 250 MHz, there are 250 x 10^6 x 16 x 2 flops - 8 Giga-flops as a GPGPU.
At 16-bit precision, this presumably gives 16 Giga-flops.
I do not understand where 12 Giga-flops comes from.

Feature Request: Enhancements on DSL

This are things I encountered during kernel development.

Operators

Given:

Int a;
Float b;
Float c;

... the following should work:

c = a+b;       // No operator for Int, Float combination
c = a*b;       // idem
a += 16;       // operator doesn't exist
c += b;        // operator doesn't exist
a = (b < 0);   // Can't assign result BoolExpr to Int

The first three have alternatives:

c = toFloat(a)+b;
c = toFloat(a)*b;
c = c + b;

...but I personally would truly appreciate it if the initial versions worked. I can sort of understand if you want to have explicit casts, but still.

Conversion of number values to DSL

I would appreciate the possibility to mix constants with DSL variables, for example:

 Int a = 1;
 Float b= 3.14f*radius;   // Also mix in expressions

Related, when using generator functions, it would be nice if a Float/Int can be initialized with a (C++) constant, just like with k.run():

void Generator(Int a, Float b, int c = 0, int d = 12.34f) { ... }
...
Generator(3, 2.78f);
Generator(valA, valB, 1, 24.68f);  // Also, pass in C++ values.

`If` that works like `Where`

I would like to see an If-operator that works like Where:

  Where (x > 10) x++; End

This works per vector element; if a given element satisfies the condition, the if-block is executed. Other elements are untouched.

Currently, Where only allows assignments and expressions. An If that works analogously, and which allows all statements, would be appreciated.

QPULib: heap overflow (increase EMULATOR_HEAP_SIZE)

I faced this problem when I run :

Rot3D with 192000 (instead of 19200)
and
HeatMap (Vector Mode)

What's the problem? And how can I fix it.

What is emulator mode talking about?

Segmentation Fault in computeLiveOut

Attempting to implement SHA256 with QPULib and I've encountered a seg fault. Backtrack from GDB when compiled for emulation.

Program received signal SIGSEGV, Segmentation fault.
0x000000000800ddfe in computeLiveOut(Seq<SmallSeq<int> >*, Seq<SmallSeq<int> >*, int, SmallSeq<int>*) ()
(gdb) bt
#0  0x000000000800ddfe in computeLiveOut(Seq<SmallSeq<int> >*, Seq<SmallSeq<int> >*, int, SmallSeq<int>*) ()
#1  0x000000000800df9c in liveness(Seq<Instr>*, Seq<SmallSeq<int> >*, Seq<SmallSeq<int> >*) ()
#2  0x000000000800ea18 in regAlloc(Seq<SmallSeq<int> >*, Seq<Instr>*) ()
#3  0x0000000008002142 in compileKernel(Seq<Instr>*, Stmt*) ()
#4  0x0000000008001be6 in Kernel<Ptr<Int>, Ptr<Int> >::Kernel(void (*)(Ptr<Int>, Ptr<Int>)) ()

Here's a minimum version which will cause the fault.

#include <iostream>
#include "QPULib.h"

static Int smsigma0(Int x) {
    return ror(x, 7) ^ ror(x, 18) ^ (x >> 3);
}

static Int smsigma1(Int x) {
    return ror(x, 17) ^ ror(x, 19) ^ (x >> 10);
}

void execute_sha256_cpu(Ptr<Int> data, Ptr<Int> hash)
{
    Int W[64];
    Int a, b, c, d, e, f, g, h;

    for (uint32_t i = 0; i < 16; i++)
        W[i] = data[i*16];
    
    for (uint32_t i = 16; i < 64; i++)
        W[i] = smsigma1(W[i-2]) + W[i-7]+ smsigma0(W[i-15]) + W[i-16];
}

int main(int argc, char **argv)
{
    // Compile the function to a QPU kernel k
    auto k = compile(execute_sha256_cpu);

    k.setNumQPUs(1);

    // Allocate and initialise arrays shared between CPU and QPUs
    SharedArray<int> data(16*64), hash(16*64);
    for (uint32_t i = 0; i < 16*64; i++)
    {
        data[i] = 0;
        hash[i] = 0;
    }

    k(&data,&hash);
}

Here's the output of the program when DEBUG is enabled

Source code
===========

v0 = UNIFORM;
v1 = UNIFORM;
v4 = UNIFORM;
v5 = UNIFORM;
v6 = *(v5+(0 << 2));
v7 = *(v5+(16 << 2));
v8 = *(v5+(32 << 2));
v9 = *(v5+(48 << 2));
v10 = *(v5+(64 << 2));
v11 = *(v5+(80 << 2));
v12 = *(v5+(96 << 2));
v13 = *(v5+(112 << 2));
v14 = *(v5+(128 << 2));
v15 = *(v5+(144 << 2));
v16 = *(v5+(160 << 2));
v17 = *(v5+(176 << 2));
v18 = *(v5+(192 << 2));
v19 = *(v5+(208 << 2));
v20 = *(v5+(224 << 2));
v21 = *(v5+(240 << 2));
v78 = v7;
v79 = (((v78 ror 7) ^ (v78 ror 18)) ^ (v78 >> 3));
v80 = v20;
v81 = (((v80 ror 17) ^ (v80 ror 19)) ^ (v80 >> 10));
v22 = (((v81+v15)+v79)+v6);
v82 = v8;
v83 = (((v82 ror 7) ^ (v82 ror 18)) ^ (v82 >> 3));
v84 = v21;
v85 = (((v84 ror 17) ^ (v84 ror 19)) ^ (v84 >> 10));
v23 = (((v85+v16)+v83)+v7);
v86 = v9;
v87 = (((v86 ror 7) ^ (v86 ror 18)) ^ (v86 >> 3));
v88 = v22;
v89 = (((v88 ror 17) ^ (v88 ror 19)) ^ (v88 >> 10));
v24 = (((v89+v17)+v87)+v8);
v90 = v10;
v91 = (((v90 ror 7) ^ (v90 ror 18)) ^ (v90 >> 3));
v92 = v23;
v93 = (((v92 ror 17) ^ (v92 ror 19)) ^ (v92 >> 10));
v25 = (((v93+v18)+v91)+v9);
v94 = v11;
v95 = (((v94 ror 7) ^ (v94 ror 18)) ^ (v94 >> 3));
v96 = v24;
v97 = (((v96 ror 17) ^ (v96 ror 19)) ^ (v96 >> 10));
v26 = (((v97+v19)+v95)+v10);
v98 = v12;
v99 = (((v98 ror 7) ^ (v98 ror 18)) ^ (v98 >> 3));
v100 = v25;
v101 = (((v100 ror 17) ^ (v100 ror 19)) ^ (v100 >> 10));
v27 = (((v101+v20)+v99)+v11);
v102 = v13;
v103 = (((v102 ror 7) ^ (v102 ror 18)) ^ (v102 >> 3));
v104 = v26;
v105 = (((v104 ror 17) ^ (v104 ror 19)) ^ (v104 >> 10));
v28 = (((v105+v21)+v103)+v12);
v106 = v14;
v107 = (((v106 ror 7) ^ (v106 ror 18)) ^ (v106 >> 3));
v108 = v27;
v109 = (((v108 ror 17) ^ (v108 ror 19)) ^ (v108 >> 10));
v29 = (((v109+v22)+v107)+v13);
v110 = v15;
v111 = (((v110 ror 7) ^ (v110 ror 18)) ^ (v110 >> 3));
v112 = v28;
v113 = (((v112 ror 17) ^ (v112 ror 19)) ^ (v112 >> 10));
v30 = (((v113+v23)+v111)+v14);
v114 = v16;
v115 = (((v114 ror 7) ^ (v114 ror 18)) ^ (v114 >> 3));
v116 = v29;
v117 = (((v116 ror 17) ^ (v116 ror 19)) ^ (v116 >> 10));
v31 = (((v117+v24)+v115)+v15);
v118 = v17;
v119 = (((v118 ror 7) ^ (v118 ror 18)) ^ (v118 >> 3));
v120 = v30;
v121 = (((v120 ror 17) ^ (v120 ror 19)) ^ (v120 >> 10));
v32 = (((v121+v25)+v119)+v16);
v122 = v18;
v123 = (((v122 ror 7) ^ (v122 ror 18)) ^ (v122 >> 3));
v124 = v31;
v125 = (((v124 ror 17) ^ (v124 ror 19)) ^ (v124 >> 10));
v33 = (((v125+v26)+v123)+v17);
v126 = v19;
v127 = (((v126 ror 7) ^ (v126 ror 18)) ^ (v126 >> 3));
v128 = v32;
v129 = (((v128 ror 17) ^ (v128 ror 19)) ^ (v128 >> 10));
v34 = (((v129+v27)+v127)+v18);
v130 = v20;
v131 = (((v130 ror 7) ^ (v130 ror 18)) ^ (v130 >> 3));
v132 = v33;
v133 = (((v132 ror 17) ^ (v132 ror 19)) ^ (v132 >> 10));
v35 = (((v133+v28)+v131)+v19);
v134 = v21;
v135 = (((v134 ror 7) ^ (v134 ror 18)) ^ (v134 >> 3));
v136 = v34;
v137 = (((v136 ror 17) ^ (v136 ror 19)) ^ (v136 >> 10));
v36 = (((v137+v29)+v135)+v20);
v138 = v22;
v139 = (((v138 ror 7) ^ (v138 ror 18)) ^ (v138 >> 3));
v140 = v35;
v141 = (((v140 ror 17) ^ (v140 ror 19)) ^ (v140 >> 10));
v37 = (((v141+v30)+v139)+v21);
v142 = v23;
v143 = (((v142 ror 7) ^ (v142 ror 18)) ^ (v142 >> 3));
v144 = v36;
v145 = (((v144 ror 17) ^ (v144 ror 19)) ^ (v144 >> 10));
v38 = (((v145+v31)+v143)+v22);
v146 = v24;
v147 = (((v146 ror 7) ^ (v146 ror 18)) ^ (v146 >> 3));
v148 = v37;
v149 = (((v148 ror 17) ^ (v148 ror 19)) ^ (v148 >> 10));
v39 = (((v149+v32)+v147)+v23);
v150 = v25;
v151 = (((v150 ror 7) ^ (v150 ror 18)) ^ (v150 >> 3));
v152 = v38;
v153 = (((v152 ror 17) ^ (v152 ror 19)) ^ (v152 >> 10));
v40 = (((v153+v33)+v151)+v24);
v154 = v26;
v155 = (((v154 ror 7) ^ (v154 ror 18)) ^ (v154 >> 3));
v156 = v39;
v157 = (((v156 ror 17) ^ (v156 ror 19)) ^ (v156 >> 10));
v41 = (((v157+v34)+v155)+v25);
v158 = v27;
v159 = (((v158 ror 7) ^ (v158 ror 18)) ^ (v158 >> 3));
v160 = v40;
v161 = (((v160 ror 17) ^ (v160 ror 19)) ^ (v160 >> 10));
v42 = (((v161+v35)+v159)+v26);
v162 = v28;
v163 = (((v162 ror 7) ^ (v162 ror 18)) ^ (v162 >> 3));
v164 = v41;
v165 = (((v164 ror 17) ^ (v164 ror 19)) ^ (v164 >> 10));
v43 = (((v165+v36)+v163)+v27);
v166 = v29;
v167 = (((v166 ror 7) ^ (v166 ror 18)) ^ (v166 >> 3));
v168 = v42;
v169 = (((v168 ror 17) ^ (v168 ror 19)) ^ (v168 >> 10));
v44 = (((v169+v37)+v167)+v28);
v170 = v30;
v171 = (((v170 ror 7) ^ (v170 ror 18)) ^ (v170 >> 3));
v172 = v43;
v173 = (((v172 ror 17) ^ (v172 ror 19)) ^ (v172 >> 10));
v45 = (((v173+v38)+v171)+v29);
v174 = v31;
v175 = (((v174 ror 7) ^ (v174 ror 18)) ^ (v174 >> 3));
v176 = v44;
v177 = (((v176 ror 17) ^ (v176 ror 19)) ^ (v176 >> 10));
v46 = (((v177+v39)+v175)+v30);
v178 = v32;
v179 = (((v178 ror 7) ^ (v178 ror 18)) ^ (v178 >> 3));
v180 = v45;
v181 = (((v180 ror 17) ^ (v180 ror 19)) ^ (v180 >> 10));
v47 = (((v181+v40)+v179)+v31);
v182 = v33;
v183 = (((v182 ror 7) ^ (v182 ror 18)) ^ (v182 >> 3));
v184 = v46;
v185 = (((v184 ror 17) ^ (v184 ror 19)) ^ (v184 >> 10));
v48 = (((v185+v41)+v183)+v32);
v186 = v34;
v187 = (((v186 ror 7) ^ (v186 ror 18)) ^ (v186 >> 3));
v188 = v47;
v189 = (((v188 ror 17) ^ (v188 ror 19)) ^ (v188 >> 10));
v49 = (((v189+v42)+v187)+v33);
v190 = v35;
v191 = (((v190 ror 7) ^ (v190 ror 18)) ^ (v190 >> 3));
v192 = v48;
v193 = (((v192 ror 17) ^ (v192 ror 19)) ^ (v192 >> 10));
v50 = (((v193+v43)+v191)+v34);
v194 = v36;
v195 = (((v194 ror 7) ^ (v194 ror 18)) ^ (v194 >> 3));
v196 = v49;
v197 = (((v196 ror 17) ^ (v196 ror 19)) ^ (v196 >> 10));
v51 = (((v197+v44)+v195)+v35);
v198 = v37;
v199 = (((v198 ror 7) ^ (v198 ror 18)) ^ (v198 >> 3));
v200 = v50;
v201 = (((v200 ror 17) ^ (v200 ror 19)) ^ (v200 >> 10));
v52 = (((v201+v45)+v199)+v36);
v202 = v38;
v203 = (((v202 ror 7) ^ (v202 ror 18)) ^ (v202 >> 3));
v204 = v51;
v205 = (((v204 ror 17) ^ (v204 ror 19)) ^ (v204 >> 10));
v53 = (((v205+v46)+v203)+v37);
v206 = v39;
v207 = (((v206 ror 7) ^ (v206 ror 18)) ^ (v206 >> 3));
v208 = v52;
v209 = (((v208 ror 17) ^ (v208 ror 19)) ^ (v208 >> 10));
v54 = (((v209+v47)+v207)+v38);
v210 = v40;
v211 = (((v210 ror 7) ^ (v210 ror 18)) ^ (v210 >> 3));
v212 = v53;
v213 = (((v212 ror 17) ^ (v212 ror 19)) ^ (v212 >> 10));
v55 = (((v213+v48)+v211)+v39);
v214 = v41;
v215 = (((v214 ror 7) ^ (v214 ror 18)) ^ (v214 >> 3));
v216 = v54;
v217 = (((v216 ror 17) ^ (v216 ror 19)) ^ (v216 >> 10));
v56 = (((v217+v49)+v215)+v40);
v218 = v42;
v219 = (((v218 ror 7) ^ (v218 ror 18)) ^ (v218 >> 3));
v220 = v55;
v221 = (((v220 ror 17) ^ (v220 ror 19)) ^ (v220 >> 10));
v57 = (((v221+v50)+v219)+v41);
v222 = v43;
v223 = (((v222 ror 7) ^ (v222 ror 18)) ^ (v222 >> 3));
v224 = v56;
v225 = (((v224 ror 17) ^ (v224 ror 19)) ^ (v224 >> 10));
v58 = (((v225+v51)+v223)+v42);
v226 = v44;
v227 = (((v226 ror 7) ^ (v226 ror 18)) ^ (v226 >> 3));
v228 = v57;
v229 = (((v228 ror 17) ^ (v228 ror 19)) ^ (v228 >> 10));
v59 = (((v229+v52)+v227)+v43);
v230 = v45;
v231 = (((v230 ror 7) ^ (v230 ror 18)) ^ (v230 >> 3));
v232 = v58;
v233 = (((v232 ror 17) ^ (v232 ror 19)) ^ (v232 >> 10));
v60 = (((v233+v53)+v231)+v44);
v234 = v46;
v235 = (((v234 ror 7) ^ (v234 ror 18)) ^ (v234 >> 3));
v236 = v59;
v237 = (((v236 ror 17) ^ (v236 ror 19)) ^ (v236 >> 10));
v61 = (((v237+v54)+v235)+v45);
v238 = v47;
v239 = (((v238 ror 7) ^ (v238 ror 18)) ^ (v238 >> 3));
v240 = v60;
v241 = (((v240 ror 17) ^ (v240 ror 19)) ^ (v240 >> 10));
v62 = (((v241+v55)+v239)+v46);
v242 = v48;
v243 = (((v242 ror 7) ^ (v242 ror 18)) ^ (v242 >> 3));
v244 = v61;
v245 = (((v244 ror 17) ^ (v244 ror 19)) ^ (v244 >> 10));
v63 = (((v245+v56)+v243)+v47);
v246 = v49;
v247 = (((v246 ror 7) ^ (v246 ror 18)) ^ (v246 >> 3));
v248 = v62;
v249 = (((v248 ror 17) ^ (v248 ror 19)) ^ (v248 >> 10));
v64 = (((v249+v57)+v247)+v48);
v250 = v50;
v251 = (((v250 ror 7) ^ (v250 ror 18)) ^ (v250 >> 3));
v252 = v63;
v253 = (((v252 ror 17) ^ (v252 ror 19)) ^ (v252 >> 10));
v65 = (((v253+v58)+v251)+v49);
v254 = v51;
v255 = (((v254 ror 7) ^ (v254 ror 18)) ^ (v254 >> 3));
v256 = v64;
v257 = (((v256 ror 17) ^ (v256 ror 19)) ^ (v256 >> 10));
v66 = (((v257+v59)+v255)+v50);
v258 = v52;
v259 = (((v258 ror 7) ^ (v258 ror 18)) ^ (v258 >> 3));
v260 = v65;
v261 = (((v260 ror 17) ^ (v260 ror 19)) ^ (v260 >> 10));
v67 = (((v261+v60)+v259)+v51);
v262 = v53;
v263 = (((v262 ror 7) ^ (v262 ror 18)) ^ (v262 >> 3));
v264 = v66;
v265 = (((v264 ror 17) ^ (v264 ror 19)) ^ (v264 >> 10));
v68 = (((v265+v61)+v263)+v52);
v266 = v54;
v267 = (((v266 ror 7) ^ (v266 ror 18)) ^ (v266 >> 3));
v268 = v67;
v269 = (((v268 ror 17) ^ (v268 ror 19)) ^ (v268 >> 10));
v69 = (((v269+v62)+v267)+v53);
flush()
If (any(v0==0))
  v270 = (v1-1);
  v271 = 0;
  While (any(v271<v270))
    semaDec(15)
    v271 = (v271+1);
  End
  hostIRQ()
Else
  semaInc(15)
End

Segmentation fault (core dumped)

Question: Should compilation with QPU=1 be blocked on non-RPi platforms?

Out of curiosity, I compiled with make QPU=1 on my 64-bit Intel machine. I got type errors:

Lib/VideoCore/SharedArray.h:121:14: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
     gpu_base = (void*) mem_lock(mb, handle);

This shouldn't be working, of course. The question is: how to deal with this?

My personal preference is to block this with an error, indicating that it can only work on RPi (which is 32-bits). So it actually shouldn't be compiling at all. Can you agree with this?

A lesser point is: irrespective of this error, should the compile be possible? This is sort of a moot thing, but for completeness sake it might be worth considering.

Proposal: Method to detect RPi platform

This answer to a related question looks like a good way to detect if you're running on a RPi: https://raspberrypi.stackexchange.com/a/61071/61753.

If tried this on my two active RPi's:

> cat /sys/firmware/devicetree/base/model
Raspberry Pi 3 Model B Rev 1.2

> cat /sys/firmware/devicetree/base/model 
Raspberry Pi Model B Rev 2

(I have a Pi 2 as well, which I could start up again for testing)

Doing this on an intel machine:

> cat /sys/firmware/devicetree/base/model
cat: /sys/firmware/devicetree/base/model: No such file or directory

What I like about this, is that it's file-based, thus easy to implement over all platforms. Would like to hear your thoughts.

Raspberry Pi 4 and VideoCore VI

Is it supported?

I couldn't find that detailed of specs on the VideoCore VI but it seems like it's suppose to be a lot faster.

Trouble extending HeatMap to kernels with more than 3 rows

Hi @mn416 ,

I am trying to use QPULib for a convolutional neural network and adapted the HeatMap example code to perform a convolution. I ran into some interesting problems and tried iteratively changing the HeatMap code to see where the problem was. It seems that creating 5 cursor objects rather than 3, and initializing them to copies of the same area as the original 3 to avoid out of bounds issues, results in a different result than simply using 3. These rows are still primed, advanced and finished but their values are simply never used. Running on the emulator returns the correct result but running on a pi gives an incorrect result. The results are incorrect running on any number of QPUs and additionally the result is not the same every time. Interestingly, 4 rows did not work as well until I changed the order of gathers and receives in the Cursor advance function. Please let me know if this is an error on my end or something wrong with the library. Thanks!

Request: informal communication channel

@mn416 I would appreciate a method of communicating without having to use the issues here. This is just to keep the issues here unpolluted with social chitchat.

As a suggestion, the vis.js[1] project uses a Gitter Lobby for informal communication. This works quite well, especially because it's coupled with github.

The idea is to have a place for questions, comments, small-talk etc. No direct response is expected (although you then do have the option for a live chat), you can totally respond in your own time.

Hope you find this appealing as well.

I just discovered that I can create a room on gitter myself. If I do so, will you join?

[1] Of which I have been an active maintainer for over a year - I'm still a maintainer officially but don't do work there any more.

Request: Add a Development branch

@mn416 Please add a branch called development to the repo. This allows us to add changes to a new version while keeping the current version stable.

In particular, this shields from awry things happening when PR's have unexpected side-effects.

Related, is there a version numbering present? If not, now would be a good time to add it.

Request: allow QPU+emulation mode compilation

The current makefile is setup up to either compile in QPU mode or in EMULATION mode. This this handled by the make parameter QPU=1.

I understand from the code and docs that it is possible to compile the two modes together. I would like to have this option available. I therefore request to replace parameter QPU=1 with the following:

No 'mode' parameter defaults to emulation, as now
mode=QPU for QPU mode
mode=EMULATION for emulation mode
mode=DUAL (or perhaps mode=BOTH) for the two modes combined

I would like to hear your thoughts on this. As before, I'm only too happy to implement this myself.

Summation of vector elements using QPULib

Hi,
I use this code that you suggested befor but receive this error what is the problem ?

Int v;
...
v = v + rotate(8, v);
v = v + rotate(4, v);
v = v + rotate(2, v);
v = v + rotate(1, v);

D.cpp:13:20: error: call of overloaded ‘rotate(int, Int&)’ is ambiguous
v = v + rotate(1, v);
^
In file included from ../Lib/QPULib.h:4:0,
from ID.cpp:1:
../Lib/Source/Int.h:61:9: note: candidate: IntExpr rotate(IntExpr, IntExpr)
IntExpr rotate(IntExpr a, IntExpr b);
^~~~~~
../Lib/Source/Int.h:62:11: note: candidate: FloatExpr rotate(FloatExpr, IntExpr)
FloatExpr rotate(FloatExpr a, IntExpr b);

thank you.

CMake scripts and emulation on Windows

I made some simple changes to QPULib in this fork to support CMake projects and to get the emulator to run on Windows. Are they worth merging into this repository?

QPULib under Full KMS

On the Raspberry Pi, Full KMS is preferred for controlling the GPU as it gives the kernel control of modesetting and adds extra Mesa/OpenGL features.

However, trying to use QPULib under this configuration results in:

"Unable to enable QPUs. Check your firmware is latest."

The failure appears to be here:
https://github.com/mn416/QPULib/blob/master/Lib/VideoCore/Mailbox.cpp#L202

With Legacy mode enabled, this works fine.

Kernel is:

Linux raspberrypi 4.19.97-v7+ #1294 SMP Thu Jan 30 13:15:58 GMT 2020 armv7l GNU/Linux

with GPU version:

(Feb 12 2020 12:39:27 )
version 53a54c770c493957d99bf49762dfabc4eee00e45 (clean) (release) (start_x)

on a Raspberry Pi 3 Model B.

Is this an unsupported configuration? What, if anything, can we do about this? Thanks for all your hard work on this library, I would love to use it.

Firmware error on RPi 3 Model B

Hi there,

I must say I'm very intrigued by your RPi GPU project, It appears to be well thought out and I'm really hoping to get it working on my itty-bitty RPi computers.

I would like to report that I can't get it to work on my RPi 3 Model B. I get the following error upon execution of any test, even after updating the firmware:

   Unable to enable QPUs. Check your firmware is latest.

I am aware that this is untested; I encountered the following comment in the docs:

It's been tested on the Pi 1 Model B, the Pi 2, but not yet the Pi 3.

So I can hereby inform you that the Pi 3 does not work.

....yet. I would really like to make this work, please advise me how to go about it. I'm perfectly willing to tinker a bit myself.

This is not the end of the story for me; I've got a couple of older version PI's burrowing around here, will catch one and see if I can get QPULib to work on that.

Edit: I looked up the revision number, it's a Pi 3 Model B, not B+.

Proposal: Add tested platforms to documentation

To inspire confidence to potential users, I think it's a good idea to add a list of platforms to the README, to indicate which combinations of platforms have been tested. It would look like this:

Platforms tested on

Pi Model	Revision number	Distribution	gcc version
Model B Rev 2	000d	your distro	your gcc version
Model B Rev 2	000e	Raspian 9.4 (stretch)	6.3.0
2 Model B Rev 1.1	a01041	Raspian 9.4 (stretch)	6.3.0
3 Model B Rev 1.2	a0282	Raspian 9.4 (stretch)	6.3.0
Intel (emulation only)	---	Suse LEAP 42.3	4.8.5

On every release to branch master, all of these should be tested, at the very least by running make QPU=1 test on all listed machines. The intel machine will run in emulation mode, of course.

Which brings me to the next point: There should be a release procedure. I.e. a list of steps to do on updating master.

store values in Non-continuous memory

Int ind = index()*2;
store(ind,C + ind );
It doesn't work when I do this, and it still store values in a continuous  memory

Another not an issue - TensorFlow for the Pi

https://www.tensorflow.org/install/install_raspbian :Tensorflow is now officially supported. I was wondering how difficult it would be to get it to use QPULib as this would make the PiZero possibly the cheapest way to get real crunching done!

Request: change name of dir 'Tests' to 'Examples'

The reason for this is:

I believe the name is a misnomer, the directory contains example programs (perhaps excluding AutoTest).
My intention is to add unit tests in due time, and putting it into any other directory than Tests will be confusing.

Please tell me if you agree to this change, I will make it myself.

General Discussion

This is an issue for discussing general things.

Possible problems with register values

Hi @mn416 , I've had a couple of issues which might stem from some problems with register values. The first problem is that there are segfaults or invalid frees from just calling compile on my kernel function. I believe this was due to excessive loop unrolling and changing for(int ... loops into For(Int... loops seems to fix the problem. Another problem I've seen a couple times is that int values do not seem to keep their values between iterations of a For loop. I'm not sure if that is intended behavior and I shouldn't use int values in a For loop over an Int value, so please let me know what the correct usage would be. As an example the following code would write a value of 1 for every output value:

int val = 0;
For(Int i = 0, i < len, i = i +16)
val++;
out[i] = val;
End

3D Rotation example: different result each run with vector version 2

Hi, when running the example 3d rotation and tweeking the source to post vertex x[0] y[0] before and after transformation, i receieve different values each time i run the program (vector 2 is being used).

tested on raspberry pi 3.

the function i am writing about:
void rot3D(Int n, Float cosTheta, Float sinTheta, Ptr x, Ptr y)
{
// Function index() returns vector <0 1 2 ... 14 15>
Ptr p = x + index();
Ptr q = y + index();
// Pre-fetch first two vectors
gather(p); gather(q);

Float xOld, yOld; //initialiased to nothing at first
For (Int i = 0, i < n, i = i+16)
// Pre-fetch two vectors for the next iteration
gather(p+16); gather(q+16);
// Receive vectors for this iteration
receive(xOld); receive(yOld);
// Store results
store(xOld * cosTheta - yOld * sinTheta, p);
store(yOld * cosTheta + xOld * sinTheta, q);
p = p+16; q = q+16;
End
}

Request: Vector of Ptr<Float> as parameter

In order to run an algorithm on batches of vectors, I'd like to be able to send a vector of pointers to arrays. Example:

void gpu_algo(Int n, Int m, Vector<Ptr<Float>> ins, Ptr<Float> out)
{
  Int inc = numQPUs() << 4;
  Float acc = 0;
  For (Int i=0, i < n, i = i +1)
    Ptr<Float> in = ins[i] + index() + (me() << 4);
    gather(in);
    Float r0;
    For (Int j=0, j < m, j += inc)
      gather(in + inc);
      receive(r0);
      ... do stuff ...
      acc = acc + r0
      in = in + inc
    End
    store(acc, *out);
    receive(r0);
  End
}

Possible ?

More QPU seems slower

Hi, I tried your example on Tri and Multitri where one uses 1 QPU and another uses 4 QPU.
I calculate the time taken in running the kernel and found that the one uses 4 QPU took more time. Is this something expected?

Proposal: Document for Code Usage Notes

I would appreciate an overall document that lists the special attributes of the library. The goal is to make the code more understandable and to get any potential up to speed quicker.

@mn416 Would like to hear if you think this is a good idea, and if the format is OK. If so, I'll flesh it out with the items I have pending for it

Code Usage Notes

This document contains specific things to know, gotcha's and limitations of the QPULib library. By being aware of these things, it is hoped that the code will be easier to use for your own purposes.

Function `compile()` is not Thread-Safe

Function compile() is used to compile a kernel from a class generator definition into a format that is runnable on a QPU. This uses global heaps internally for e.g. generating the AST and for storing the resulting statements.

Because the heaps are global, running compile() parallel on different threads will lead to problems. The result of the compile, however, should be fine, so it's possible to have multiple kernel instances on different threads.

As long a you run compile() on a single thread at a time, you're OK.

....<more to come>...

Question: Aren't there 16 QPU's?

In target/Emulator.h, I encountered the following:

#define MAX_QPUS 12

However, the VideoCore IV Reference Guide implies that there are 16 QPU's, in the diagram on page 13 and the text on page 14.

I do have to admit that the text in the guide is a bit vague. From page 14:

QPUs are organized into groups of up to four, termed slices,....

The words 'up to' leave too much room for speculation.

Questions:

How did you determine that the number of QPU's is 12?
Is there a way to detect the number of QPU's at runtime?

In addition, I note that the link to the reference guide on the QPULib page is stale. This is a working link:

https://docs.broadcom.com/docs-and-downloads/docs/support/videocore/VideoCoreIV-AG100-R.pdf

Question: How do you mark a program as threadable?

I encountered the following in the VideoCore IV reference on page 21:

For 3D fragment shader use, each QPU can execute two separate program threads if both the fragment shader programs are marked as threadable.

I'm wondering how you can mark a program as 'threadable'? Are there special instructions for that? I can't seem to find anything about it in the document, nor on google.

Issue while porting build to CMake

Hi, I'm doing a fork of QPULib (at https://github.com/robiwano/QPULib), where I plan to add CMake support (I need this for a project to be using QPULib).

But I get a strange error, which I frankly don't understand how the Makefile of QPULib itself manages to handle, input would be welcome :) :

.../SharedArray.h:102:25: error: there are no arguments to ‘getMailbox’ that depend on a template parameter, so a declaration of ‘getMailbox’ must be available [-fpermissive]

Ideas ?

Request: replace printf() with C++ output streams

The current code is replete with printf() statements, for example in Lib/Target/Pretty.cpp:

void pretty(Instr instr)
{
  switch (instr.tag) {
...
    case ALU:
      if (instr.ALU.cond.tag != ALWAYS) {
        printf("where ");
        pretty(instr.ALU.cond);
        printf(": ");
      }
      pretty(instr.ALU.dest);
      printf(" <-%s ", instr.ALU.setFlags ? "{sf}" : "");
      pretty(instr.ALU.op);
      printf("(");
      pretty(instr.ALU.srcA);
      printf(", ");
      pretty(instr.ALU.srcB);
      printf(")\n");
      return;
...

For more flexibility wrt output, it would be preferable to output to a stream instead of directly to stdout. The code would then look like:

void pretty(std::ostream &os,Instr instr)
{
  switch (instr.tag) {
...
    case ALU:
      if (instr.ALU.cond.tag != ALWAYS) {
        os << "where ";
        pretty(os, instr.ALU.cond);
        os << ": ";
      }
      pretty(os, instr.ALU.dest);
      os << " <-" << (instr.ALU.setFlags ? "{sf}" : "") << " ";
      pretty(os, instr.ALU.op);
      os << "(";
      pretty(os, instr.ALU.srcA);
      os << ", ";
      pretty(os, instr.ALU.srcB);
      os << ")\n";
      return;
...

The advantage of this is that the output is target-agnostic; how to further handle it is deferred to the caller.

As before, if you can agree with this, I would be happy to implement it throughout the code.

Summation of vector elements using QPULib

I want to perform summation of elements of a vector using QPULib, is it possible to extract a scalar from a vector using QPULib, if so how can we implement that. An example will be extremely helpful.
Please help @mn416 .

Can I contribute or should I fork?

@mn416 I would love to contribute to the project. It is in my opinion the best thing on github for GPU programming on the Raspberry Pi and I would really like to take it further. I personally have an interest in making GPU programming for the RPi more accessible, and I believe that any changes I make will benefit everybody.

However, I have the impression that there is no active maintenance on QPULib, despite the observation that there was some activity a month ago. In addition, I have had no feedback from you about previous requests on this. I am now tempted to just fork the project and take it from there - but I would really prefer that it stays in its original place.

So, for your understanding, this is what I want to do as a first step:

Raise the makefile 1 level and expand it
Add some documentation for better understanding
Rearrange current documentation
Add doc generation, notably with doxygen
Add unit tests with the CATCH framework

All this is to make it better usable for other developers, in particular me. Note that I do not intend to change the working of the code now, perhaps some rearrangement for usage in the unit tests.

I would severely appreciate some feedback on this. Can you agree that these are worthwhile additions? I prefer to avoid maintaining a separate fork if at all necessary.

Hoping to hear from you.

Feature Request: Runtime setting of QPU timeout

The timeout value on QPU execution is currently determined by the define QPU_TIMEOUT.

It would be more flexible if the timeout could be set at runtime, e.g.:

  k.setTimeout(value);   // k is a compiled kernel

Missing Link for QPU doc

The link for QPU described in Background on README file was no longer valid.

I'd say the link http://www.broadcom.com/docs/support/videocore/VideoCoreIV-AG100-R.pdf is now https://docs.broadcom.com/doc/12358545.

Not sure if it's true. Please check the above link.

mn416 / qpulib Goto Github PK

qpulib's People

Contributors

Stargazers

Watchers

Forkers

qpulib's Issues

Operators

Conversion of number values to DSL

If that works like Where

Platforms tested on

Code Usage Notes

Function compile() is not Thread-Safe

Recommend Projects

Recommend Topics

Recommend Org

`If` that works like `Where`

Function `compile()` is not Thread-Safe