CODE GPU WITH CUDA
SIMT
NVIDIA GPU ARCHITECTURE
CreatedbyMarinaKolpakova( )forcuda.geek Itseez
BACK TO CONTENTS
OUTLINE
Hardware revisions
SIMT architecture
Warp scheduling
Divergence & convergence
Predicated execution
Conditional execution
OUT OF SCOPE
Computer graphics capabilities
HARDWARE REVISIONS
SM (shading model) – particular hardware implementation.
Generation SM GPU models
Tesla sm_10 G80 G92(b) G94(b)
sm_11 G86 G84 G98 G96(b) G94(b) G92(b)
sm_12 GT218 GT216 GT215
sm_13 GT200 GT200b
Fermi sm_20 GF100 GF110
sm_21 GF104 GF114 GF116 GF108 GF106
Kepler sm_30 GK104 GK106 GK107
sm_32 GK20A
sm_35 GK110 GK208
sm_37 GK210
Maxwell sm_50 GM107 GM108
sm_52 GM204
sm_53 GM20B
LATENCY VS THROUGHPUT ARCHITECTURES
Modern CPUs and GPUs are both multi-core systems.
CPUs are latency oriented:
Pipelining, out-of-order, superscalar
Caching, on-die memory controllers
Speculative execution, branch prediction
Compute cores occupy only a small part of a die
GPUs are throughput oriented:
100s simple compute cores
Zero cost scheduling of 1000s or threads
Compute cores occupy most part of a die
SIMD – SIMT – SMT
Single Instruction Multiple Thread
SIMD: elements of short vectors are processed in parallel. Represents problem as short
vectors and processes it vector by vector. Hardware support for wide arithmetic.
SMT: instructions from several threads are run in parallel. Represents problem as scope
of independent tasks and assigns them to different threads. Hardware support for multi-
threading.
SIMT vector processing + light-weight threading:
Warp is a unit of execution. It performs the same instruction each cycle. Warp is 32-
lane wide
thread scheduling and fast context switching between different warps to minimize
stalls
SIMT
DEPTH OF MULTI-THREADING × WIDTH OF SIMD
1. SIMT is abstraction over vector hardware:
Threads are grouped into warps (32 for NVIDIA)
A thread in a warp usually called lane
Vector register file. Registers accessed line by line.
A lane loads laneId’s element from register
Single program counter (PC) for whole warp
Only a couple of special registers, like PC, can be scalar
2. SIMT HW is responsible for warp scheduling:
Static for all latest hardware revisions
Zero overhead on context switching
Long latency operation score-boarding
SASS ISA
SIMT is like RISC
Memory instructions are separated from arithmetic
Arithmetic performed only on registers and immediates
SIMT PIPELINE
Warp scheduler manages warps, selects ready to execute
Fetch/decode unit is associated with warp scheduler
Execution units are SC, SFU, LD/ST
Area-/power-efficiency thanks to regularity.
VECTOR REGISTER FILE
~Zero warp switching requires a big vector register file (RF)
While warp is resident on SM it occupies a portion of RF
GPU's RF is 32-bit. 64-bit values are stored in register pair
Fast switching costs register wastage on duplicated items
Narrow data types are as costly as wide data types.
Size of RF depends on architecture. Fermi: 128 KB per SM, Kepler: 256 KB per SM,
Maxwell: 64 KB per scheduler.
DYNAMIC VS STATIC SCHEDULING
Static scheduling
instructions are fetched, executed & completed in compiler-generated order
In-order execution
in case one instruction stalls, all following stall too
Dynamic scheduling
instructions are fetched in compiler-generated order
instructions are executed out-of-order
Special unit to track dependencies and reorder instructions
independent instructions behind a stalled instruction can pass it
WARP SCHEDULING
GigaThread subdivide work between SMs
Work for SM is sent to Warp Scheduler
One assigned warp can not migrate between schedulers
Warp has own lines in register file, PC, activity mask
Warp can be in one of the following states:
Executed - perform operation
Ready - wait to be executed
Wait - wait for resources
Resident - wait completion of other warps within the same block
WARP SCHEDULING
Depending on generation scheduling is dynamic (Fermi) or static (Kepler, Maxwell)
WARP SCHEDULING (CONT)
Modern warp schedulers support dual
issue (sm_21+) to decode instruction pair
for active warp per clock
SM has 2 or 4 warp schedulers depending
on the architecture
Warps belong to blocks. Hardware tracks
this relations as well
DIVERGENCE & (RE)CONVERGENCE
Divergence: not all lanes in a warp take the same code path
Convergence handled via convergence stack
Convergence stack entry includes
convergence PC
next-path PC
lane mask (mark active lanes on that path)
SSY instruction pushes convergence stack. It occurs before potentially divergent
instructions
<INSTR>.S indicates convergence point – instruction after which all lanes in a warp take
the same code path
DIVERGENT CODE EXAMPLE
( v o i d ) a t o m i c A d d ( & s m e m [ 0 ] , s r c [ t h r e a d I d x . x ] ) ;
/ * 0 0 5 0 * / S S Y 0 x 8 0 ;
/ * 0 0 5 8 * / L D S L K P 0 , R 3 , [ R Z ] ;
/ * 0 0 6 0 * / @ P 0 I A D D R 3 , R 3 , R 0 ;
/ * 0 0 6 8 * / @ P 0 S T S U L [ R Z ] , R 3 ;
/ * 0 0 7 0 * / @ ! P 0 B R A 0 x 5 8 ;
/ * 0 0 7 8 * / N O P . S ;
Assume warp size == 4
PREDICATED & CONDITIONAL EXECUTION
Predicated execution
Frequently used for if-then statements, rarely for if-then-else. Decision is made by
compiler heuristic.
Optimizes divergence overhead.
Conditional execution
Compare instruction sets condition code (CC) registers.
CC is 4-bit state vector (sign, carry, zero, overflow)
No WB stage for CC-marked registers
Used in Maxwell to skip unneeded computations for arithmetic operations
implemented in hardware with multiple instructions
I M A D R 8 . C C , R 0 , 0 x 4 , R 3 ;
FINAL WORDS
SIMT is RISC-based throughput oriented architecture
SIMT combines vector processing and light-weight threading
SIMT instructions are executed per warp
Warp has its own PC and activity mask
Branching is done by divergence, predicated or conditional execution
THE END
NEXT
BY / 2013–2015CUDA.GEEK

Code GPU with CUDA - SIMT

  • 1.
    CODE GPU WITHCUDA SIMT NVIDIA GPU ARCHITECTURE CreatedbyMarinaKolpakova( )forcuda.geek Itseez BACK TO CONTENTS
  • 2.
    OUTLINE Hardware revisions SIMT architecture Warpscheduling Divergence & convergence Predicated execution Conditional execution
  • 3.
    OUT OF SCOPE Computergraphics capabilities
  • 4.
    HARDWARE REVISIONS SM (shadingmodel) – particular hardware implementation. Generation SM GPU models Tesla sm_10 G80 G92(b) G94(b) sm_11 G86 G84 G98 G96(b) G94(b) G92(b) sm_12 GT218 GT216 GT215 sm_13 GT200 GT200b Fermi sm_20 GF100 GF110 sm_21 GF104 GF114 GF116 GF108 GF106 Kepler sm_30 GK104 GK106 GK107 sm_32 GK20A sm_35 GK110 GK208 sm_37 GK210 Maxwell sm_50 GM107 GM108 sm_52 GM204 sm_53 GM20B
  • 5.
    LATENCY VS THROUGHPUTARCHITECTURES Modern CPUs and GPUs are both multi-core systems. CPUs are latency oriented: Pipelining, out-of-order, superscalar Caching, on-die memory controllers Speculative execution, branch prediction Compute cores occupy only a small part of a die GPUs are throughput oriented: 100s simple compute cores Zero cost scheduling of 1000s or threads Compute cores occupy most part of a die
  • 6.
    SIMD – SIMT– SMT Single Instruction Multiple Thread SIMD: elements of short vectors are processed in parallel. Represents problem as short vectors and processes it vector by vector. Hardware support for wide arithmetic. SMT: instructions from several threads are run in parallel. Represents problem as scope of independent tasks and assigns them to different threads. Hardware support for multi- threading. SIMT vector processing + light-weight threading: Warp is a unit of execution. It performs the same instruction each cycle. Warp is 32- lane wide thread scheduling and fast context switching between different warps to minimize stalls
  • 7.
    SIMT DEPTH OF MULTI-THREADING× WIDTH OF SIMD 1. SIMT is abstraction over vector hardware: Threads are grouped into warps (32 for NVIDIA) A thread in a warp usually called lane Vector register file. Registers accessed line by line. A lane loads laneId’s element from register Single program counter (PC) for whole warp Only a couple of special registers, like PC, can be scalar 2. SIMT HW is responsible for warp scheduling: Static for all latest hardware revisions Zero overhead on context switching Long latency operation score-boarding
  • 8.
    SASS ISA SIMT islike RISC Memory instructions are separated from arithmetic Arithmetic performed only on registers and immediates
  • 9.
    SIMT PIPELINE Warp schedulermanages warps, selects ready to execute Fetch/decode unit is associated with warp scheduler Execution units are SC, SFU, LD/ST Area-/power-efficiency thanks to regularity.
  • 10.
    VECTOR REGISTER FILE ~Zerowarp switching requires a big vector register file (RF) While warp is resident on SM it occupies a portion of RF GPU's RF is 32-bit. 64-bit values are stored in register pair Fast switching costs register wastage on duplicated items Narrow data types are as costly as wide data types. Size of RF depends on architecture. Fermi: 128 KB per SM, Kepler: 256 KB per SM, Maxwell: 64 KB per scheduler.
  • 11.
    DYNAMIC VS STATICSCHEDULING Static scheduling instructions are fetched, executed & completed in compiler-generated order In-order execution in case one instruction stalls, all following stall too Dynamic scheduling instructions are fetched in compiler-generated order instructions are executed out-of-order Special unit to track dependencies and reorder instructions independent instructions behind a stalled instruction can pass it
  • 12.
    WARP SCHEDULING GigaThread subdividework between SMs Work for SM is sent to Warp Scheduler One assigned warp can not migrate between schedulers Warp has own lines in register file, PC, activity mask Warp can be in one of the following states: Executed - perform operation Ready - wait to be executed Wait - wait for resources Resident - wait completion of other warps within the same block
  • 13.
    WARP SCHEDULING Depending ongeneration scheduling is dynamic (Fermi) or static (Kepler, Maxwell)
  • 14.
    WARP SCHEDULING (CONT) Modernwarp schedulers support dual issue (sm_21+) to decode instruction pair for active warp per clock SM has 2 or 4 warp schedulers depending on the architecture Warps belong to blocks. Hardware tracks this relations as well
  • 15.
    DIVERGENCE & (RE)CONVERGENCE Divergence:not all lanes in a warp take the same code path Convergence handled via convergence stack Convergence stack entry includes convergence PC next-path PC lane mask (mark active lanes on that path) SSY instruction pushes convergence stack. It occurs before potentially divergent instructions <INSTR>.S indicates convergence point – instruction after which all lanes in a warp take the same code path
  • 16.
    DIVERGENT CODE EXAMPLE (v o i d ) a t o m i c A d d ( & s m e m [ 0 ] , s r c [ t h r e a d I d x . x ] ) ; / * 0 0 5 0 * / S S Y 0 x 8 0 ; / * 0 0 5 8 * / L D S L K P 0 , R 3 , [ R Z ] ; / * 0 0 6 0 * / @ P 0 I A D D R 3 , R 3 , R 0 ; / * 0 0 6 8 * / @ P 0 S T S U L [ R Z ] , R 3 ; / * 0 0 7 0 * / @ ! P 0 B R A 0 x 5 8 ; / * 0 0 7 8 * / N O P . S ; Assume warp size == 4
  • 17.
    PREDICATED & CONDITIONALEXECUTION Predicated execution Frequently used for if-then statements, rarely for if-then-else. Decision is made by compiler heuristic. Optimizes divergence overhead. Conditional execution Compare instruction sets condition code (CC) registers. CC is 4-bit state vector (sign, carry, zero, overflow) No WB stage for CC-marked registers Used in Maxwell to skip unneeded computations for arithmetic operations implemented in hardware with multiple instructions I M A D R 8 . C C , R 0 , 0 x 4 , R 3 ;
  • 18.
    FINAL WORDS SIMT isRISC-based throughput oriented architecture SIMT combines vector processing and light-weight threading SIMT instructions are executed per warp Warp has its own PC and activity mask Branching is done by divergence, predicated or conditional execution
  • 19.
    THE END NEXT BY /2013–2015CUDA.GEEK