NOTE On 2014-02-28, Broadcom announced the release of full documentation for the VideoCore IV graphics core, and a complete source release of the graphics stack at https://www.raspberrypi.org/blog/a-birthday-present-from-broadcom/. Their release largely obsoletes the information contained here. But we have left it here for historical reasons.
Fun and Games with the Videocoreiv Quad Processor Units
The BCM2835 SoC (System on a Chip) in the RaspberryPi has the following significant computation units:
- ARM1176JZF-S 700 MHz processor which acts as the "main" processor and typically runs Linux.
- Dualcore Videocore IV CPU @250MHz with SIMD Parallel Pixel Units (PPU) which runs scalar (integer and float) and vector (integer only) programs. Runs ThreadX OS, and generally coordinates all functional blocks such as video codecs, power management, video out.
- Image Sensor Pipeline (ISP) providing lens shading, satistics and distortion correction.
- QPU units which provide 24 GFLOPS compute performance for coordinate, vertex and pixel shaders.
Let's focus on the QPU here.
Disclaimer:
There is good precedent in Copyright Law for the following assumption:
- Given, an input I, with copyright holder C(I)
- a program or algorithm A with copyright holder C(A)
- a program output O, given by O=A(I)
that C(O) = C(I), given C(A) injects no artistic work during the operation of A().
We aim to feed different inputs into functions provided by the blob, and
document and analyse the outputs - without violating copyright law.
We will feed various inputs such as shader programs into the blob via entry
points (such as provided by OpenGL ES) and observe the outputs.
At all times we will obey:
"This software may only be used for the purposes of developing for,
running or using a Raspberry Pi device."
as stipulated in the license:
https://github.com/raspberrypi/firmware/blob/master/boot/LICENCE.broadcom.
In recommended reading order:
- US20110227920 Method and System for a Shader Processor With Closely Couple Peripherals
- US20110154307 Method and System for Utilizing Data Flow Graphs to Compile Shaders
- US20110148901 Method and System for Tile Mode Renderer With Coordinate Shader
- US20110216069 Method and System for Compressing Tile Lists Used for 3d Rendering
- US20110221743 Method and System for Controlling a 3d Processor Using a Control List in Memory
- US20110242113 Method and System for Processing Pixels Utilizing Scoreboarding
- US20110261059 Method and System for Decomposing Complex Shapes Into Curvy RHTS For Rasterization
- US20080291208 Method and System for Processing Data Via a 3d Pipeline Coupled to a Generic Video Processing Unit
In recommended reading order:
- Fixed length instruction word of 64 bits.
- Instructions contain multiple issue slots.
- There is a slot for the Add vector ALU and Multiply vector ALU.
- Registers written in one cycle, should not be read back for 1 instruction cycle.
- Branch instructions have 3 delay slots.
- Thread switching is handled by (cooperative) thread switch instructions.
- Program may be terminated by instruction with program end signal, two delay slots will be executed before the unit becomes idle.
- Decoupled memory access operations are: reciprocal, reciprocal square root, logarithm and exponential (US20110227920-0078).
- Accumulators written in one cycle are available immediately in the next instruction (US20110227920-0092).
- Accumulators are a0...a4 or r0...r5 (here) (US20110227920-0094).
- The two register files have 32 registers each (US20110227920-0095), whilst the other 32 register addresses refer to peripheral IO.
- The rotator allows a vector to be rotated by any one of 16 horizontal rotations (US20110227920-0096).
- The unpackers can unpack register data, whilst the packers pack it (US20110227920-0097).
- Support for zero-extension of 8-bit data, sign extension of 16-bit data, convert 16-bit floats to 32-bit floats. (US20110227920-0099).
- Values written to registers (not accumulators) are not available in the next cycle (US20110227920-0103).
- Condition field allows conditional write back of ALU results (US20110227920-0104), and updating of condition flags is optional.
- 32-bit bit data may be written back to registers/accumulators as an alternative to ALU results (US20110227920-0105).
- Branches have 3-delay slots (no prediction), and may be conditional based on ALU flag bits. Branches may provide link functionality. (US20110227920-0106).
- Instructions may include a signalling field (US20110227920-0107) without costing an additional instruction. Typical uses include tile buffer access, or end of program / thread switch (both with 2 delay slots).
I'm realigning the names with those found in some of the blobs. For instance thread-switch becomes thrsw, gl_FragColor becomes tlbc, min8 becomes v8min and so forth. The main purpose of this change is in the vain hope that if any of the work here becomes useful the names are common for anyone doing this in a commercial context.
<addop><addcc> wa, radda, raddb [setf] ; <mulop><mulcc> wb, rmula, rmulb [setf] ; <op> <addop><addcc> wb, radda, raddb [setf] ; <mulop><mulcc> wa, rmula, rmulb [setf] ; <op> radda = ra | ra >> shift | imm6 | rb | r0..r5 raddb = rb | ra >> shift | imm6 | rb | r0..r5 rmula = ra | ra >> shift | imm6 | rb | r0..r5 rmulb = rb | ra >> shift | imm6 | rb | r0..r5
Encoding:
mulop:3 addop:5 ra:6 rb:6 adda:3 addb:3 mula:3 mulb:3, op:4 packbits:8 addcc:3 mulcc:3 F:1 X:1 wa:6 wb:6 mulop:3 addop:5 ra:6 imm:6 adda:3 addb:3 mula:3 mulb:3, 1101 packbits:8 addcc:3 mulcc:3 F:1 X:1 wa:6 wb:6
Where:
addop is the add ALU operation.
00000 nop
00001 fadd rd = ra + rb (floating point addition)
00010 fsub rd = ra - rb (floating point subtraction)
00011 fmin rd = fmin(ra, rb) (floating point minimum)
00100 fmax rd = fmax(ra, rb) (floating point maximum)
00101 fminabs rd = fminabs(ra, rb)
00110 fmaxabs rd = fmaxabs(ra, rb)
00111 ftoi rd = int(rb) (convert float to int)
01000 itof rd = float(rb) (convert int to float)
01001
01010
01011
01100 add rd = ra + rb (integer addition)
01101 sub rd = ra - rb (integer subtraction)
01110 shr rd = ra >>> rb (logical shift right)
01111 asr rd = ra >> rb (arithmetic shift right)
10000 ror rd = ror(ra, rb) (rotate right)
10001 shl rd = ra << rb (logical shift left)
10010 min rd = min(ra, rb) (integer min)
10011 max rd = max(ra, rb) (integer max)
10100 and rd = ra & rb (bitwise and)
10101 or rd = ra | rb (bitwise or, note: or rd, ra, ra is used for mov)
10110 xor rd = ra ^ rb (bitwise xor)
10111 not rd = ~rb (bitwise not)
11000 clz rd = clz(rb) (count leading zeros)
11001
11010
11011
11100
11101
11110 v8adds rd[i] = sat8(ra[i]+rb[i]), i = 0..3 / a..d
11111 v8subs rd[i] = sat8(ra[i]-rb[i]), i = 0..3 / a..d
mulop is the multiplication ALU operation.
000 nop
001 fmul rd = ra * rb
010 mul24
011 v8muld rd[i] = ra[i] * rb[3], i = 0..3 / a..d
100 v8min rd[i] = min(ra[i], rb[i]), i = 0..3 / a..d, (note: v8min rd, ra, ra is used for mov)
101 v8max rd[i] = max(ra[i], rb[i]), i = 0..3 / a..d
110 v8adds rd[i] = sat8(ra[i] + rb[i]), i = 0..3 / a..d
111 v8subs rd[i] = sat8(ra[i] - rb[i]), i = 0..3 / a..d
op is the signaling or control flow operation.
0000 bpkt
0001 nop
0010 thrsw thread switch
0011 thrend thread end
0100 sbwait scoreboard wait
0101 sbdone scoreboard done
0110 lthrsw last thread switch
0111 loadcv
1000 loadc load tlb color
1001 ldcend load tlb color and thread end
1010 ldtmu0 load tmu0
1011 ldtmu1 load tmu1
1100 loadam
1101 nop (small constant encoded in field rb)
1110 ldi load immediate
1111 bra branch
(Replacing the following names, thread-switch, thread-end, scoreboard-wait, scoreboard-done, last-thread-switch,
(openvg coverage?), load-gl_FragColor, load-gl_FragColor-and-thread-end, load-tmu0, load-tmu1,
(openvg alpha mask?))
adda, addb encode which accumulator or ra, rb value will be supplied to the add ALU.
mula, mulb encode which accumulator or ra, rb value will be supplied to the multiplication ALU.
000 r0 accumulator 0
001 r1 accumulator 1
010 r2 accumulator 2
011 r3 accumulator 3
100 r4 accumulator 4
101 r5 accumulator 5
110 ra register from bank a
111 rb register from bank b
packbits control the packing/unpacking operation.
Each 32 bit value can be viewed as (a:8, b:8, c:8, d:8) or (a:16, b:16)
uuu0pppp unpack from ra0-31 only, pack to ra0-31 only.
uuu1pppp unpack from r4 only, pack (multiply dst only) to r0-r3, ra0-31 or rb0-31.
uuu unpacking add/mul source (rb)
000 (32) full 32 bit value
001 16a unpack from 16a
010 16b unpack from 16b
011 8dr unpack as 8d replicated, ie (d:8, d:8, d:8, d:8)
100 8a unpack from 8a
101 8b unpack from 8b
110 8c unpack from 8c
111 8d unpack from 8d
0pppp pack add result
0000 (32)
0001 16a
0010 16b
0011 8abcd
0100 8a
0101 8b
0110 8c
0111 8d
1000 s
1001 16as
1010 16bs
1011 8abcds
1100 8as
1101 8bs
1110 8cs
1111 8ds
1pppp pack mul result
0000 (32)
0001
0010
0011 8abcd
0100 8a
0101 8b
0110 8c
0111 8d
1000
1001
1010
1011
1100
1101
1110
1111
addcc holds the cc predicate for conditional execution of the add instruction.
mulcc holds the cc predicate for conditional execution of the mul instruction.
000 .never never
001 always
010 .zs zero set
011 .zc zero clear
100 .ns negative set
101 .nc negative clear
110 .cs carry set
111 .cc carry clear
F is set to update cc flags (there are Zero, Negative and Carry flags per unit) - SETF
Normally the result of the add operation is used to determine the new cc flags.
If the add operation is a nop, then the result of the multiply operation is used.
X is set to exchange values on the writeback (ie the crossed lines in the diagram).
ra is register bank A value to read.
ra0..ra31 are registers, whilst ra32..ra63 are peripheral addresses.
rb is register bank B value to read.
rb0..rb31 are registers, whilst rb32..rb63 are peripheral addresses.
wa is destination for the add or mul result (depends on X).
ra0..ra31 are registers, whilst ra32..ra63 are peripheral addresses.
wb is destination for the add or mul result (depends on X).
rb0..rb31 are registers, whilst rb32..rb63 are peripheral addresses.
ra rb wa wb
000000 ra00 rb00 ra00 rb00
000001 ra01 rb01 ra01 rb01
000010 ra02 rb02 ra02 rb02
000011 ra03 rb03 ra03 rb03
000100 ra04 rb04 ra04 rb04
000101 ra05 rb05 ra05 rb05
000110 ra06 rb06 ra06 rb06
000111 ra07 rb07 ra07 rb07
001000 ra08 rb08 ra08 rb08
001001 ra09 rb09 ra09 rb09
001010 ra10 rb10 ra10 rb10
001011 ra11 rb11 ra11 rb11
001100 ra12 rb12 ra12 rb12
001101 ra13 rb13 ra13 rb13
001110 ra14 rb14 ra14 rb14
001111 ra15 (w) rb15 (z) ra15 (w) rb15 (z)
001000 ra16 rb16 ra16 rb16
001001 ra17 rb17 ra17 rb17
001010 ra18 rb18 ra18 rb18
001011 ra19 rb19 ra19 rb19
001100 ra20 rb20 ra20 rb20
001101 ra21 rb21 ra21 rb21
001110 ra22 rb22 ra22 rb22
001111 ra23 rb23 ra23 rb23
011000 ra24 rb24 ra24 rb24
011001 ra25 rb25 ra25 rb25
011010 ra26 rb26 ra26 rb26
011011 ra27 rb27 ra27 rb27
011100 ra28 rb28 ra28 rb28
011101 ra29 rb29 ra29 rb29
011110 ra30 rb30 ra30 rb30
011111 ra31 rb31 ra31 rb31
100000 unif unif r0 r0
100001 r1 r1
100010 r2 r2
100011 vary vary r3 r3
100100 tmurs tmurs
100101 r5quad r5rep
100110 elem_num qpu_num irq irq
100111 (nop) (nop) (nop) (nop)
101000 unif_addr unif_addr_rel
101001 x_coord y_coord x_coord y_coord
101010 ms_mask rev_flag ms_mask rev_flag
101011 stencil stencil
101100 tlbz tlbz
101101 tlbm tlbm
101110 tlbc tlbc
101111 tlbam? tlbam?
110000 vpm vpm vpm vpm
110001 vr_busy vw_busy vr_setup vw_setup
110010 vr_wait vw_wait vr_addr vw_addr
110011 mutex mutex mutex mutex
110100 recip recip
110101 recipsqrt recipsqrt
110110 exp exp
110111 log log
111000 t0s t0s
111001 t0t t0t
111010 t0r t0r
111011 t0b t0b
111100 t1s t1s
111101 t1t t1t
111110 t1r t1r
111111 t1b t1b
rb - Small constants, active when signal/control operation is 1101:
imm ra rb
0 i:5 ra i Signed 4 bit immediate
10 i:4 ra 1.0 << i Shift by signed 4 bit quantity
11 0000 ra >> A5 -
11 d:4 ra >> d -
# Branch absolute to addr+ra, optionally save return address to wa and/or wb. bra[<cond>] [wa|wb], addr[+ra] # Branch relative to pc+addr+ra, optionally save return address to wa and/or wb. brr[<cond>] [wa|wb], addr[+ra]
Encoding:
addr:32, 1111 0000 cond:4 relative:1 register:1 ra:5 X:1 wa:6 wb:6
Where:
addr is the target address
cond is the condition code:
0000 .allz all zero set
0001 .allnz all zero clear
0010 .anyz any zero set
0011 .anynz any zero clear
0100 .alln all negative set
0101 .allnn all negative clear
0110 .anyn any negative set
0111 .anynn any negative clear
1000 .allc all carry set
1001 .allnc all carry clear
1010 .anycs any carry set
1011 .anycc any carry clear
xxxx unknown
relative is set if the target is relative.
register is set if the target should be addr + ra
X, wa, wb are used to write the return address to a register
(ie. branch and link).
movi[<addcc>] wa, data [setf] ; movi[<mulcc>] wb, data [setf] movi[<addcc>] wb, data [setf] ; movi[<mulcc>] wa, data [setf]
Encoding:
data:32, 1110 unknown:8 addcc:3 mulcc:3 F:1 X:1 wa:6 wb:6
Where:
data is constant to be loaded. addcc, mulcc, F, X, wa and wb as above.
Use qpu-sniff from the qpu-sniff directory. Example:
- First fragment is the fragment shader.
- Second fragment is the full vertex shader.
- Third fragment is the coordinate shader (vertex shader only concerned with Vertex positions - used for tiling).
vs/null.vs:
void main(void) {
}
fs/add.fs:
uniform vec4 c1;
uniform vec4 c2;
void main(void) {
gl_FragColor = c1+c2;
}
('shader code' 18402720 88)
00000000: 15827d80 10020827 mov r0, unif
00000002: 01827c00 40020867 fadd r1, unif, r0; nop; sbwait
00000004: 15827d80 10020827 mov r0, unif
00000006: 01827c00 10020827 fadd r0, unif, r0
00000008: 95827d80 114258a0 mov r2, unif; mov r0.8a, r0
0000000a: 81827c89 11525860 fadd r1, unif, r2; mov r0.8b, r1
0000000c: 95827d89 11625860 mov r1, unif; mov r0.8c, r1
0000000e: 01827c40 10020867 fadd r1, unif, r1
00000010: 809e7009 317059e0 nop; mov r0.8d, r1; thrend
00000012: 159e7000 10020ba7 mov tlbc, r0
00000014: 009e7000 500009e7 nop; nop; sbdone
('shader code' 184027a0 104)
00000000: 15827d80 10120027 mov ra0.16a, unif
00000002: 15827d80 10220027 mov ra0.16b, unif
00000004: 15827d80 10021c67 mov vw_setup, unif
00000006: 15827d80 10020c27 mov vpm, unif
00000008: 15827d80 10020c27 mov vpm, unif
0000000a: 15827d80 10020c27 mov vpm, unif
0000000c: 15827d80 10020c27 mov vpm, unif
0000000e: 95020dbf 10024c20 mov vpm, ra0; mov r0, unif
00000010: 01827c00 10020c27 fadd vpm, unif, r0
00000012: 15827d80 10020c27 mov vpm, unif
00000014: 009e7000 300009e7 nop; nop; thrend
00000016: 009e7000 100009e7 nop
00000018: 009e7000 100009e7 nop
('shader code' 185092a0 72)
00000000: 15827d80 10120027 mov ra0.16a, unif
00000002: 15827d80 10220027 mov ra0.16b, unif
00000004: 15827d80 10021c67 mov vw_setup, unif
00000006: 95020dbf 10024c20 mov vpm, ra0; mov r0, unif
00000008: 01827c00 10020c27 fadd vpm, unif, r0
0000000a: 15827d80 10020c27 mov vpm, unif
0000000c: 009e7000 300009e7 nop; nop; thrend
0000000e: 009e7000 100009e7 nop
00000010: 009e7000 100009e7 nop
Under Raspbian /opt/vc/bin/ and /opt/vc/bin/vcdbg and /opt/vc/bin/vcgencmd may be used to poke about on the videocore side. See https://github.com/nezticle/RaspberryPi-BuildRoot/wiki/VideoCore-Tools for more information.
- Run a OpenGL ES program, and whilst it is running (or ideally paused so the shaders are static):
$ sudo vcdbg reloc
This gives for example (removing the non relevant entries):
[ 23] 0x1c509340: used 160 (refcount 3 lock count 0, size 116, align 4, data 0x1c509360, d0rual) 'EGL_SERVER_SURFACE_T' [ 24] 0x1c5093e0: used 7.9M (refcount 512 lock count 511, size 8294400, align 4096, data 0x1c50a000, d1Rual) 'KHRN_IMAGE_T.storage' [ 40] 0x1ccf33e0: used 7.9M (refcount 1 lock count 0, size 8294400, align 4096, data 0x1ccf4000, d1Rual) 'KHRN_IMAGE_T.storage' [ 39] 0x1d4dd3e0: used 7.9M (refcount 1 lock count 0, size 8294400, align 4096, data 0x1d4de000, d1Rual) 'KHRN_IMAGE_T.storage' [ 27] 0x1dcc73e0: used 160 (refcount 1 lock count 0, size 118, align 1, data 0x1dcc7400, d0rual) 'mem_strdup' [ 42] 0x1dcc7480: used 1.2K (refcount 2 lock count 0, size 1200, align 4, data 0x1dcc74a0, d0rual) 'GL20_PROGRAM_T' [ 26] 0x1dcc7960: used 64 (refcount 1 lock count 0, size 24, align 4, data 0x1dcc7980, d0rual) 'GL20_PROGRAM_T.uniform_data' [ 25] 0x1dcc79a0: used 640 (refcount 2 lock count 0, size 588, align 4, data 0x1dcc79c0, d0rual) 'GLXX_BUFFER_T' [ 37] 0x1dcc7c20: used 96 (refcount 1 lock count 0, size 64, align 4, data 0x1dcc7c40, D1rual) 'GLXX_BUFFER_INNER_T.storage' [ 16] 0x1dcc7c80: used 96 (refcount 1 lock count 0, size 48, align 8, data 0x1dcc7ca0, d1rual) 'shader code' [ 33] 0x1dcc7ce0: used 256 (refcount 1 lock count 0, size 216, align 8, data 0x1dcc7d00, d1rual) 'shader code' [ 30] 0x1dcc7de0: used 128 (refcount 1 lock count 0, size 96, align 4, data 0x1dcc7e00, d0rual) 'uniform map' [ 15] 0x1dcc7e60: used 256 (refcount 1 lock count 0, size 208, align 8, data 0x1dcc7e80, d1rual) 'shader code' [ 28] 0x1dd381e0: used 3.0K (refcount 1 lock count 0, size 3072, align 4, data 0x1dd38200, d0RUal) 'GLSL_COPY_CONTEXT_T.mh_blob' [ 29] 0x1dd38e00: used 128 (refcount 1 lock count 0, size 96, align 4, data 0x1dd38e20, d0rual) 'uniform map'
The fragments can then be saved via:
$ sudo vcdbg save shader_code_1 0x1dcc7c80 96 $ sudo vcdbg save shader_code_2 0x1dcc7ce0 256 $ sudo vcdbg save shader_code_3 0x1dcc7e60 256 $ sudo vcdbg save uniform_map_1 0x1dcc7de0 128 $ sudo vcdbg save uniform_map_2 0x1dd38e00 128 $ sudo vcdbg save GL20_PROGRAM_T 0x1dcc7480 1200 $ sudo vcdbg save GL20_PROGRAM_T.uniform_data 0x1dcc7960 64 $ sudo vcdbg save GLXX_BUFFER_T 0x1dcc79a0 640 $ sudo vcdbg save GLXX_BUFFER_INNER_T.storage 0x1dcc7c20 96
