A ScyllaDB Community
GPUS, and How to Program Them
Manya Bansal
Ph.D. Student
A ScyllaDB Community
GPUS, and How to Program Them
Manya Bansal
Ph.D. Student
not
^
Source: Dally, B. (2023, August 29). Hardware for Deep Learning. Hot Chips 35 Symposium.
Source: Dally, B. (2023, August 29). Hardware for Deep Learning. Hot Chips 35 Symposium.
K20X
2012
3.04 TOPS
Source: Dally, B. (2023, August 29). Hardware for Deep Learning. Hot Chips 35 Symposium.
K20X M40
2015
2012
3.04 TOPS 6.84 TOPS
FP16 HDP4
Source: Dally, B. (2023, August 29). Hardware for Deep Learning. Hot Chips 35 Symposium.
K20X M40 V100
2015 2017
2012
3.04 TOPS 6.84 TOPS 125 TOPS
FP16 Tensor Cores
FP16 HDP4
Source: Dally, B. (2023, August 29). Hardware for Deep Learning. Hot Chips 35 Symposium.
K20X M40 V100 A100
2015 2017
2012 2020
3.04 TOPS 6.84 TOPS 125 TOPS 1238 TOPS
FP16 HDP4 Structured
Sparsity
FP16 Tensor Cores
Source: Dally, B. (2023, August 29). Hardware for Deep Learning. Hot Chips 35 Symposium.
K20X M40 V100 A100 H100
2015 2017 2022
2012 2020
3.04 TOPS 6.84 TOPS 125 TOPS 1238 TOPS 4000 TOPS
TMA
FP16 HDP4
FP8 Tensor Cores
Structured
Sparsity
FP16 Tensor Cores
GPUs fundamentally shift the CPU
programming model
CPU Execution Model
CPU
Execution
Model
ADD R0, R0, #1
CMP R0, R1
BNE loop
SUB R2, R1, #2
MOV R7, #1
MOV R0, #0
SWI 0
CPU
Execution
Model
ADD R0, R0, #1
CMP R0, R1
BNE loop
SUB R2, R1, #2
MOV R7, #1
MOV R0, #0
SWI 0
Operating System
Software
abstractions
Compiler
CPU
Execution
Model
ADD R0, R0, #1
CMP R0, R1
BNE loop
SUB R2, R1, #2
MOV R7, #1
MOV R0, #0
SWI 0
Operating System
Software
abstractions
Compiler
CPU
Execution
Model
ADD R0, R0, #1
CMP R0, R1
BNE loop
SUB R2, R1, #2
MOV R7, #1
MOV R0, #0
SWI 0
Operating System
OoO Execution
Software
abstractions
Compiler
CPU
Execution
Model
ADD R0, R0, #1
CMP R0, R1
BNE loop
SUB R2, R1, #2
MOV R7, #1
MOV R0, #0
SWI 0
Operating System
OoO Execution
Caches
Software
abstractions
Compiler
CPU
Execution
Model
ADD R0, R0, #1
CMP R0, R1
BNE loop
SUB R2, R1, #2
MOV R7, #1
MOV R0, #0
SWI 0
Operating System
OoO Execution
Caches
Software
abstractions
Can I get an
estimate of
how much
performance
was added?
Massively Parallel Architecture
… …
… …
… …
… …
Massively Parallel Architecture
… …
… …
… …
… …
GPU
Massively Parallel Architecture
…
SM
…
… …
…
SM
…
… …
GPU
Massively Parallel Architecture
…
SM
…
… …
Block
Block
…
SM
…
… …
Block
Block
GPU
Massively Parallel Architecture
…
SM
…
Warp
Warp
… …
Warp
Warp Block
Block
…
SM
…
Warp
Warp
… …
Warp
Warp
Block
Block
GPU
…
Warp
ADD R0, R0, #1
CMP R0, R1
BNE loop
SUB R2, R1, #2
MOV R7, #1
MOV R0, #0
SWI 0
…
Warp
Instruction Cache
LD/ST LD/ST
LD/ST
Tensor Cores
FP32
Register File
Warp Scheduler
…
SM
…
Warp
Warp
… …
Warp
Warp
Block
Block
Instruction Cache
LD/ST LD/ST
LD/ST
Tensor Cores
FP32
Register File
Warp Scheduler
Instruction Cache
LD/ST LD/ST
LD/ST
Tensor Cores
FP32
Register File
Warp Scheduler
Instruction Cache
LD/ST LD/ST
LD/ST
Tensor Cores
FP32
Register File
Warp Scheduler
Instruction Cache
LD/ST LD/ST
LD/ST
Tensor Cores
FP32
Register File
Warp Scheduler
Deep Compute Hierarchy
…
SM
…
Warp
Warp
… …
Warp
Warp Block
Block
…
SM
…
Warp
Warp
… …
Warp
Warp
Block
Block
GPU
Deep Memory Hierarchy
…
SM
…
Warp
Warp
… …
Warp
Warp Block
Block
…
SM
…
Warp
Warp
… …
Warp
Warp
Block
Block
GPU
Deep Memory Hierarchy
…
SM
…
Warp
Warp
… …
Warp
Warp Block
Block
…
SM
…
Warp
Warp
… …
Warp
Warp
Block
Block
GPU
Global
Memory
Deep Memory Hierarchy
…
SM
…
Warp
Warp
… …
Warp
Warp Block
Block
…
SM
…
Warp
Warp
… …
Warp
Warp
Block
Block
GPU
Global
Memory
Local
Memory
Deep Memory Hierarchy
…
SM
…
Warp
Warp
… …
Warp
Warp Block
Block
…
SM
…
Warp
Warp
… …
Warp
Warp
Block
Block
GPU
Global
Memory
Local
Memory
Shared
Memory
Global Memory
CPU
3 TB/S
Global Memory
CPU
L2 Cache 12 TB/S
3 TB/S
Global Memory
Shared
Memory
CPU
L1 Cache
L2 Cache 12 TB/S
33 TB/S
3 TB/S
Global Memory
Shared
Memory
Registers
CPU
L1 Cache
L2 Cache 12 TB/S
33 TB/S
3 TB/S
Software has different responsibilities
ADD R0, R0, #1
CMP R0, R1
BNE loop
SUB R2, R1, #2
MOV R7, #1
MOV R0, #0
SWI 0
Operating System
OoO Execution
Mapping program to
compute
ADD R0, R0, #1
CMP R0, R1
BNE loop
SUB R2, R1, #2
MOV R7, #1
MOV R0, #0
SWI 0
Caches
Latency hiding by
pipelining &
multi-thrd
Programmer-managed
SRAM, sync
Compiler
Compiler
Caches
Application
Programming Languages for GPUs
GPUs can be programmed at
different levels of abstraction
GPUs can be programmed at
different levels of abstraction
GPUs can be programmed at
different levels of abstraction
GPUs can be programmed at
different levels of abstraction
PTX
GPUs can be programmed at
different levels of abstraction
PTX
Our Focus: CUDA
Basically, C++
A Simple Program
1
a
b
+
…
A Simple Program
__global__ void add_1(const float *a, float *b, int N){
for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){
for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){
int offset = (blockIdx.x*blockDim.x+threadIdx.x);
a += offset; // Move to the thread’s position
b += offset; // Move to the thread’s position
if (offset < N){
b[i] = a[i] + 1;
}}}}
1
a
b
+
…
A Simple Program
__global__ void add_1(const float *a, float *b, int N){
for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){
for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){
int offset = (blockIdx.x*blockDim.x+threadIdx.x);
a += offset; // Move to the thread’s position
b += offset; // Move to the thread’s position
if (offset < N){
b[i] = a[i] + 1;
}}}}
1
a
b
+
… …
…
A Simple Program
__global__ void add_1(const float *a, float *b, int N){
for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){
for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){
int offset = (blockIdx.x*blockDim.x+threadIdx.x);
a += offset; // Move to the thread’s position
b += offset; // Move to the thread’s position
if (offset < N){
b[i] = a[i] + 1;
}}}}
1
a
b
+
… …
…
A Simple Program
__global__ void add_1(const float *a, float *b, int N){
for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){
for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){
int offset = (blockIdx.x*blockDim.x+threadIdx.x);
a += offset; // Move to the thread’s position
b += offset; // Move to the thread’s position
if (offset < N){
b[i] = a[i] + 1;
}}}}
1
a
b
+
… …
…
A Simple Program
__global__ void add_1(const float *a, float *b, int N){
for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){
for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){
int offset = (blockIdx.x*blockDim.x+threadIdx.x);
a += offset; // Move to the thread’s position
b += offset; // Move to the thread’s position
if (offset < N){
b[i] = a[i] + 1;
}}}}
1
a
b
+
… …
…
A Simple Program
__global__ void add_1(const float *a, float *b, int N){
for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){
for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){
int offset = (blockIdx.x*blockDim.x+threadIdx.x);
a += offset; // Move to the thread’s position
b += offset; // Move to the thread’s position
if (offset < N){
b[i] = a[i] + 1;
}}}}
1
a
b
+
… …
…
A Simple Program
__global__ void add_1(const float* a,float* b, int N){
for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){
for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){
extern __shared__ float smem[];
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int tid = threadIdx.x;
// Load with circular shift
if (idx < N) {
int ld_idx = blockIdx.x* blockDim.x
+ ((tid + 1) % blockDim.x);
if (load_idx < N)
smem[(tid + 1) % blockDim.x] = a[ld_idx];
__syncthreads();
b[idx] = smem[tid] + 1.0f;
}}}
… …
…
S H A R E D
b
a
A Simple Program
__global__ void add_1(const float* a,float* b, int N){
for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){
for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){
extern __shared__ float smem[];
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int tid = threadIdx.x;
// Load with circular shift
if (idx < N) {
int ld_idx = blockIdx.x* blockDim.x
+ ((tid + 1) % blockDim.x);
if (load_idx < N)
smem[(tid + 1) % blockDim.x] = a[ld_idx];
__syncthreads();
b[idx] = smem[tid] + 1.0f;
}}}
… …
…
S H A R E D
b
a
A Simple Program
__global__ void add_1(const float* a,float* b, int N){
for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){
for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){
extern __shared__ float smem[];
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int tid = threadIdx.x;
// Load with circular shift
if (idx < N) {
int ld_idx = blockIdx.x* blockDim.x
+ ((tid + 1) % blockDim.x);
if (load_idx < N)
smem[(tid + 1) % blockDim.x] = a[ld_idx];
__syncthreads();
b[idx] = smem[tid] + 1.0f;
}}}
… …
…
S H A R E D
b
a
A Simple Program
__global__ void add_1(const float* a,float* b, int N){
for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){
for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){
extern __shared__ float smem[];
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int tid = threadIdx.x;
// Load with circular shift
if (idx < N) {
int ld_idx = blockIdx.x* blockDim.x
+ ((tid + 1) % blockDim.x);
if (load_idx < N)
smem[(tid + 1) % blockDim.x] = a[ld_idx];
__syncthreads();
b[idx] = smem[tid] + 1.0f;
}}}
… …
…
S H A R E D
b
a
A Simple Program
__global__ void add_1(const float* a,float* b, int N){
for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){
for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){
extern __shared__ float smem[];
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int tid = threadIdx.x;
// Load with circular shift
if (idx < N) {
int ld_idx = blockIdx.x* blockDim.x
+ ((tid + 1) % blockDim.x);
if (load_idx < N)
smem[(tid + 1) % blockDim.x] = a[ld_idx];
__syncthreads();
b[idx] = smem[tid] + 1.0f;
}}}
… …
…
S H A R E D
b
a
A Simple Program
__global__ void add_1(const float* a,float* b, int N){
for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){
for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){
extern __shared__ float smem[];
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int tid = threadIdx.x;
// Load with circular shift
if (idx < N) {
int ld_idx = blockIdx.x* blockDim.x
+ ((tid + 1) % blockDim.x);
if (load_idx < N)
smem[(tid + 1) % blockDim.x] = a[ld_idx];
__syncthreads();
b[idx] = smem[tid] + 1.0f;
}}}
… …
…
S H A R E D
b
a
A Simple Program
__global__ void add_1(const float* a,float* b, int N){
for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){
for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){
extern __shared__ float smem[];
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int tid = threadIdx.x;
// Load with circular shift
if (idx < N) {
int ld_idx = blockIdx.x* blockDim.x
+ ((tid + 1) % blockDim.x);
if (load_idx < N)
smem[(tid + 1) % blockDim.x] = a[ld_idx];
__syncthreads();
b[idx] = smem[tid] + 1.0f;
}}}
… …
…
S H A R E D
b
a
A Simple Program
__global__ void add_1(const float* a,float* b, int N){
for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){
for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){
extern __shared__ float smem[];
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int tid = threadIdx.x;
// Load with circular shift
if (idx < N) {
int ld_idx = blockIdx.x* blockDim.x
+ ((tid + 1) % blockDim.x);
if (load_idx < N)
smem[(tid + 1) % blockDim.x] = a[ld_idx];
__syncthreads();
b[idx] = smem[tid] + 1.0f;
}}}
… …
…
S H A R E D
b
a
CUDA programs can get complex
CUDA programs can get complex
CUDA programs can get complex
CUDA programs can get complex
H A
__global__ void add_1(const float* a,float* b, int N){
for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){
for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){
extern __shared__ float smem[];
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int tid = threadIdx.x;
// Load with circular shift
if (idx < N) {
int ld_idx = blockIdx.x* blockDim.x
+ ((tid + 1) % blockDim.x);
if (load_idx < N)
smem[(tid + 1) % blockDim.x] = a[ld_idx];
__syncthreads();
b[idx] = smem[tid] + 1.0f;
}}}
… …
…
S H A R E D
b
a
A nonsensical
program
__global__ void foo(const float* a,float* b, int N){
for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){
for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){
if (threadIdx.x == 0){
__syncthreads();
}
// What now?
}}
A more nonsensical program
CUDA programs the machine from
the perspective of a single lane
CUDA programs the machine from
the perspective of a single lane
until lanes need to collaborate
Our first program
__device__ void add_1(const float *a, float *b, int N){
for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){
for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){
int offset = (blockIdx.x*blockDim.x+threadIdx.x);
a += offset; // Move to the thread’s position
b += offset; // Move to the thread’s position
if (offset < N){
b[i] = a[i] + 1;
}}}}
1
a
b
+
… …
…
__device__ void add_1(const float* a,float* b, int N){
for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){
for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){
extern __shared__ float smem[];
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int tid = threadIdx.x;
// Load with circular shift
if (idx < N) {
int ld_idx = blockIdx.x* blockDim.x
+ ((tid + 1) % blockDim.x);
if (load_idx < N)
smem[(tid + 1) % blockDim.x] = a[ld_idx];
__syncthreads();
b[idx] = smem[tid] + 1.0f;
}}}
… …
…
S H A R E D
b
a
Our second program
Spot the difference
__device__ void add_1(const float *a, float *b, int N);
__device__ void add_1(const float *a, float *b, int N);
#1
#2
Spot the
difference
__device__ void add_1(const float *a, float *b, int N);
__device__ void add_1(const float *a, float *b, int N);
#1
#2
Function calls are
also per-lane!
Spot the difference
__device__ void add_1(const float *a, float *b, int N);
__device__ void add_1(const float *a, float *b, int N);
#1
#2
a += offset;
b += offset;
if (offset < N){
b[i] = a[i] + 1;
if (load_idx < N)
smem[(tid + 1) % blockDim.x] = ...
__syncthreads();
b[idx] = smem[tid] + 1.0f;
Spot the difference
__device__ void add_1(const float *a, float *b, int N);
__device__ void add_1(const float *a, float *b, int N);
#1
#2
a += offset;
b += offset;
if (offset < N){
b[i] = a[i] + 1;
if (load_idx < N)
smem[(tid + 1) % blockDim.x] = ...
__syncthreads();
b[idx] = smem[tid] + 1.0f;
Cannot trust the implementation
from looking at the interface
Even ; is broken
__global__ void foo(const float* a,float* b, int N){
for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){
for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){
if (threadIdx.x == 0){
__syncthreads();
}
// What now?
}}
Even ; is broken
__global__ void foo(const float* a,float* b, int N){
for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){
for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){
if (threadIdx.x == 0){
__syncthreads();
}
// What now?
}}
Compute hierarchy is
implicit in the program
Anytime a shared resource is
touched, the abstraction breaks
Missing a
mechanism for
shared memory
management
Missing a
mechanism for
shared memory
management
Shared Memory L1 Cache
Dynamic Shared Memory
Missing a
mechanism for
shared memory
management
__device__ void foo_shared()
{
extern __shared__ char smem[];
float* a_smem = (float*)smem;
float* b_smem = (float*)smem + A_SIZE * 4;
…
}
Missing a
mechanism for
shared memory
management
__device__ void foo_shared()
{
extern __shared__ char smem[];
float* a_smem = (float*)smem;
float* b_smem = (float*)smem + A_SIZE * 4;
…
foo_local(a_smem, b_mem);
}
Missing a
mechanism for
shared memory
management
No malloc/free!
__device__ void foo_shared()
{
extern __shared__ char smem[];
float* a_smem = (float*)smem;
float* b_smem = (float*)smem + A_SIZE * 4;
…
foo_local(a_smem, b_mem);
}
Memory hierarchy
is implicit in the
program
__device__ void foo_local(float * a, float * b)
{
}
__device__ void foo_shared()
{
extern __shared__ char smem[];
float* a_smem = (float*)smem;
float* b_smem = (float*)smem + A_SIZE * 4;
…
foo_local(a_smem, b_mem);
}
Memory hierarchy
is implicit in the
program
__device__ void foo_shared()
{
extern __shared__ char smem[];
float* a_smem = (float*)smem;
float* b_smem = (float*)smem + A_SIZE * 4;
…
foo_local(a_smem, b_mem);
}
__device__ void foo_local(float * a, float * b)
{
a[0] = threadIdx.x + b[0];
}
CUDA programs the machine from
the perspective of a single lane
until lanes need to collaborate
CUDA programs the machine from
the perspective of a single lane
until lanes need to collaborate
no modularity… .
Lack of composition has fragmented the ecosystem
Lack of composition has fragmented the ecosystem
Lack of composition has fragmented the ecosystem
Lack of composition has fragmented the ecosystem
Low-level, safe substrate makes
it easier to build productive tools
Thank you!
Manya Bansal
manya227@mit.edu
@314Bansal
https://manya-bansal.github.io/

GPUS and How to Program Them by Manya Bansal

  • 1.
    A ScyllaDB Community GPUS,and How to Program Them Manya Bansal Ph.D. Student
  • 2.
    A ScyllaDB Community GPUS,and How to Program Them Manya Bansal Ph.D. Student not ^
  • 3.
    Source: Dally, B.(2023, August 29). Hardware for Deep Learning. Hot Chips 35 Symposium.
  • 4.
    Source: Dally, B.(2023, August 29). Hardware for Deep Learning. Hot Chips 35 Symposium. K20X 2012 3.04 TOPS
  • 5.
    Source: Dally, B.(2023, August 29). Hardware for Deep Learning. Hot Chips 35 Symposium. K20X M40 2015 2012 3.04 TOPS 6.84 TOPS FP16 HDP4
  • 6.
    Source: Dally, B.(2023, August 29). Hardware for Deep Learning. Hot Chips 35 Symposium. K20X M40 V100 2015 2017 2012 3.04 TOPS 6.84 TOPS 125 TOPS FP16 Tensor Cores FP16 HDP4
  • 7.
    Source: Dally, B.(2023, August 29). Hardware for Deep Learning. Hot Chips 35 Symposium. K20X M40 V100 A100 2015 2017 2012 2020 3.04 TOPS 6.84 TOPS 125 TOPS 1238 TOPS FP16 HDP4 Structured Sparsity FP16 Tensor Cores
  • 8.
    Source: Dally, B.(2023, August 29). Hardware for Deep Learning. Hot Chips 35 Symposium. K20X M40 V100 A100 H100 2015 2017 2022 2012 2020 3.04 TOPS 6.84 TOPS 125 TOPS 1238 TOPS 4000 TOPS TMA FP16 HDP4 FP8 Tensor Cores Structured Sparsity FP16 Tensor Cores
  • 9.
    GPUs fundamentally shiftthe CPU programming model
  • 10.
  • 11.
    CPU Execution Model ADD R0, R0,#1 CMP R0, R1 BNE loop SUB R2, R1, #2 MOV R7, #1 MOV R0, #0 SWI 0
  • 12.
    CPU Execution Model ADD R0, R0,#1 CMP R0, R1 BNE loop SUB R2, R1, #2 MOV R7, #1 MOV R0, #0 SWI 0 Operating System Software abstractions
  • 13.
    Compiler CPU Execution Model ADD R0, R0,#1 CMP R0, R1 BNE loop SUB R2, R1, #2 MOV R7, #1 MOV R0, #0 SWI 0 Operating System Software abstractions
  • 14.
    Compiler CPU Execution Model ADD R0, R0,#1 CMP R0, R1 BNE loop SUB R2, R1, #2 MOV R7, #1 MOV R0, #0 SWI 0 Operating System OoO Execution Software abstractions
  • 15.
    Compiler CPU Execution Model ADD R0, R0,#1 CMP R0, R1 BNE loop SUB R2, R1, #2 MOV R7, #1 MOV R0, #0 SWI 0 Operating System OoO Execution Caches Software abstractions
  • 16.
    Compiler CPU Execution Model ADD R0, R0,#1 CMP R0, R1 BNE loop SUB R2, R1, #2 MOV R7, #1 MOV R0, #0 SWI 0 Operating System OoO Execution Caches Software abstractions Can I get an estimate of how much performance was added?
  • 17.
    Massively Parallel Architecture …… … … … … … …
  • 18.
    Massively Parallel Architecture …… … … … … … … GPU
  • 19.
  • 20.
    Massively Parallel Architecture … SM … …… Block Block … SM … … … Block Block GPU
  • 21.
    Massively Parallel Architecture … SM … Warp Warp …… Warp Warp Block Block … SM … Warp Warp … … Warp Warp Block Block GPU
  • 22.
    … Warp ADD R0, R0,#1 CMP R0, R1 BNE loop SUB R2, R1, #2 MOV R7, #1 MOV R0, #0 SWI 0
  • 23.
    … Warp Instruction Cache LD/ST LD/ST LD/ST TensorCores FP32 Register File Warp Scheduler
  • 24.
    … SM … Warp Warp … … Warp Warp Block Block Instruction Cache LD/STLD/ST LD/ST Tensor Cores FP32 Register File Warp Scheduler Instruction Cache LD/ST LD/ST LD/ST Tensor Cores FP32 Register File Warp Scheduler Instruction Cache LD/ST LD/ST LD/ST Tensor Cores FP32 Register File Warp Scheduler Instruction Cache LD/ST LD/ST LD/ST Tensor Cores FP32 Register File Warp Scheduler
  • 25.
    Deep Compute Hierarchy … SM … Warp Warp …… Warp Warp Block Block … SM … Warp Warp … … Warp Warp Block Block GPU
  • 26.
    Deep Memory Hierarchy … SM … Warp Warp …… Warp Warp Block Block … SM … Warp Warp … … Warp Warp Block Block GPU
  • 27.
    Deep Memory Hierarchy … SM … Warp Warp …… Warp Warp Block Block … SM … Warp Warp … … Warp Warp Block Block GPU Global Memory
  • 28.
    Deep Memory Hierarchy … SM … Warp Warp …… Warp Warp Block Block … SM … Warp Warp … … Warp Warp Block Block GPU Global Memory Local Memory
  • 29.
    Deep Memory Hierarchy … SM … Warp Warp …… Warp Warp Block Block … SM … Warp Warp … … Warp Warp Block Block GPU Global Memory Local Memory Shared Memory
  • 30.
  • 31.
  • 32.
    Global Memory Shared Memory CPU L1 Cache L2Cache 12 TB/S 33 TB/S 3 TB/S
  • 33.
  • 34.
    Software has differentresponsibilities ADD R0, R0, #1 CMP R0, R1 BNE loop SUB R2, R1, #2 MOV R7, #1 MOV R0, #0 SWI 0 Operating System OoO Execution Mapping program to compute ADD R0, R0, #1 CMP R0, R1 BNE loop SUB R2, R1, #2 MOV R7, #1 MOV R0, #0 SWI 0 Caches Latency hiding by pipelining & multi-thrd Programmer-managed SRAM, sync Compiler Compiler Caches Application
  • 35.
  • 36.
    GPUs can beprogrammed at different levels of abstraction
  • 37.
    GPUs can beprogrammed at different levels of abstraction
  • 38.
    GPUs can beprogrammed at different levels of abstraction
  • 39.
    GPUs can beprogrammed at different levels of abstraction PTX
  • 40.
    GPUs can beprogrammed at different levels of abstraction PTX
  • 41.
  • 42.
  • 43.
  • 44.
    A Simple Program __global__void add_1(const float *a, float *b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ int offset = (blockIdx.x*blockDim.x+threadIdx.x); a += offset; // Move to the thread’s position b += offset; // Move to the thread’s position if (offset < N){ b[i] = a[i] + 1; }}}} 1 a b + …
  • 45.
    A Simple Program __global__void add_1(const float *a, float *b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ int offset = (blockIdx.x*blockDim.x+threadIdx.x); a += offset; // Move to the thread’s position b += offset; // Move to the thread’s position if (offset < N){ b[i] = a[i] + 1; }}}} 1 a b + … … …
  • 46.
    A Simple Program __global__void add_1(const float *a, float *b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ int offset = (blockIdx.x*blockDim.x+threadIdx.x); a += offset; // Move to the thread’s position b += offset; // Move to the thread’s position if (offset < N){ b[i] = a[i] + 1; }}}} 1 a b + … … …
  • 47.
    A Simple Program __global__void add_1(const float *a, float *b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ int offset = (blockIdx.x*blockDim.x+threadIdx.x); a += offset; // Move to the thread’s position b += offset; // Move to the thread’s position if (offset < N){ b[i] = a[i] + 1; }}}} 1 a b + … … …
  • 48.
    A Simple Program __global__void add_1(const float *a, float *b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ int offset = (blockIdx.x*blockDim.x+threadIdx.x); a += offset; // Move to the thread’s position b += offset; // Move to the thread’s position if (offset < N){ b[i] = a[i] + 1; }}}} 1 a b + … … …
  • 49.
    A Simple Program __global__void add_1(const float *a, float *b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ int offset = (blockIdx.x*blockDim.x+threadIdx.x); a += offset; // Move to the thread’s position b += offset; // Move to the thread’s position if (offset < N){ b[i] = a[i] + 1; }}}} 1 a b + … … …
  • 50.
    A Simple Program __global__void add_1(const float* a,float* b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ extern __shared__ float smem[]; int idx = blockIdx.x * blockDim.x + threadIdx.x; int tid = threadIdx.x; // Load with circular shift if (idx < N) { int ld_idx = blockIdx.x* blockDim.x + ((tid + 1) % blockDim.x); if (load_idx < N) smem[(tid + 1) % blockDim.x] = a[ld_idx]; __syncthreads(); b[idx] = smem[tid] + 1.0f; }}} … … … S H A R E D b a
  • 51.
    A Simple Program __global__void add_1(const float* a,float* b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ extern __shared__ float smem[]; int idx = blockIdx.x * blockDim.x + threadIdx.x; int tid = threadIdx.x; // Load with circular shift if (idx < N) { int ld_idx = blockIdx.x* blockDim.x + ((tid + 1) % blockDim.x); if (load_idx < N) smem[(tid + 1) % blockDim.x] = a[ld_idx]; __syncthreads(); b[idx] = smem[tid] + 1.0f; }}} … … … S H A R E D b a
  • 52.
    A Simple Program __global__void add_1(const float* a,float* b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ extern __shared__ float smem[]; int idx = blockIdx.x * blockDim.x + threadIdx.x; int tid = threadIdx.x; // Load with circular shift if (idx < N) { int ld_idx = blockIdx.x* blockDim.x + ((tid + 1) % blockDim.x); if (load_idx < N) smem[(tid + 1) % blockDim.x] = a[ld_idx]; __syncthreads(); b[idx] = smem[tid] + 1.0f; }}} … … … S H A R E D b a
  • 53.
    A Simple Program __global__void add_1(const float* a,float* b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ extern __shared__ float smem[]; int idx = blockIdx.x * blockDim.x + threadIdx.x; int tid = threadIdx.x; // Load with circular shift if (idx < N) { int ld_idx = blockIdx.x* blockDim.x + ((tid + 1) % blockDim.x); if (load_idx < N) smem[(tid + 1) % blockDim.x] = a[ld_idx]; __syncthreads(); b[idx] = smem[tid] + 1.0f; }}} … … … S H A R E D b a
  • 54.
    A Simple Program __global__void add_1(const float* a,float* b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ extern __shared__ float smem[]; int idx = blockIdx.x * blockDim.x + threadIdx.x; int tid = threadIdx.x; // Load with circular shift if (idx < N) { int ld_idx = blockIdx.x* blockDim.x + ((tid + 1) % blockDim.x); if (load_idx < N) smem[(tid + 1) % blockDim.x] = a[ld_idx]; __syncthreads(); b[idx] = smem[tid] + 1.0f; }}} … … … S H A R E D b a
  • 55.
    A Simple Program __global__void add_1(const float* a,float* b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ extern __shared__ float smem[]; int idx = blockIdx.x * blockDim.x + threadIdx.x; int tid = threadIdx.x; // Load with circular shift if (idx < N) { int ld_idx = blockIdx.x* blockDim.x + ((tid + 1) % blockDim.x); if (load_idx < N) smem[(tid + 1) % blockDim.x] = a[ld_idx]; __syncthreads(); b[idx] = smem[tid] + 1.0f; }}} … … … S H A R E D b a
  • 56.
    A Simple Program __global__void add_1(const float* a,float* b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ extern __shared__ float smem[]; int idx = blockIdx.x * blockDim.x + threadIdx.x; int tid = threadIdx.x; // Load with circular shift if (idx < N) { int ld_idx = blockIdx.x* blockDim.x + ((tid + 1) % blockDim.x); if (load_idx < N) smem[(tid + 1) % blockDim.x] = a[ld_idx]; __syncthreads(); b[idx] = smem[tid] + 1.0f; }}} … … … S H A R E D b a
  • 57.
    A Simple Program __global__void add_1(const float* a,float* b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ extern __shared__ float smem[]; int idx = blockIdx.x * blockDim.x + threadIdx.x; int tid = threadIdx.x; // Load with circular shift if (idx < N) { int ld_idx = blockIdx.x* blockDim.x + ((tid + 1) % blockDim.x); if (load_idx < N) smem[(tid + 1) % blockDim.x] = a[ld_idx]; __syncthreads(); b[idx] = smem[tid] + 1.0f; }}} … … … S H A R E D b a
  • 58.
    CUDA programs canget complex
  • 59.
    CUDA programs canget complex
  • 60.
    CUDA programs canget complex
  • 61.
    CUDA programs canget complex
  • 62.
    H A __global__ voidadd_1(const float* a,float* b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ extern __shared__ float smem[]; int idx = blockIdx.x * blockDim.x + threadIdx.x; int tid = threadIdx.x; // Load with circular shift if (idx < N) { int ld_idx = blockIdx.x* blockDim.x + ((tid + 1) % blockDim.x); if (load_idx < N) smem[(tid + 1) % blockDim.x] = a[ld_idx]; __syncthreads(); b[idx] = smem[tid] + 1.0f; }}} … … … S H A R E D b a A nonsensical program
  • 63.
    __global__ void foo(constfloat* a,float* b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ if (threadIdx.x == 0){ __syncthreads(); } // What now? }} A more nonsensical program
  • 64.
    CUDA programs themachine from the perspective of a single lane
  • 65.
    CUDA programs themachine from the perspective of a single lane until lanes need to collaborate
  • 66.
    Our first program __device__void add_1(const float *a, float *b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ int offset = (blockIdx.x*blockDim.x+threadIdx.x); a += offset; // Move to the thread’s position b += offset; // Move to the thread’s position if (offset < N){ b[i] = a[i] + 1; }}}} 1 a b + … … …
  • 67.
    __device__ void add_1(constfloat* a,float* b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ extern __shared__ float smem[]; int idx = blockIdx.x * blockDim.x + threadIdx.x; int tid = threadIdx.x; // Load with circular shift if (idx < N) { int ld_idx = blockIdx.x* blockDim.x + ((tid + 1) % blockDim.x); if (load_idx < N) smem[(tid + 1) % blockDim.x] = a[ld_idx]; __syncthreads(); b[idx] = smem[tid] + 1.0f; }}} … … … S H A R E D b a Our second program
  • 68.
    Spot the difference __device__void add_1(const float *a, float *b, int N); __device__ void add_1(const float *a, float *b, int N); #1 #2
  • 69.
    Spot the difference __device__ voidadd_1(const float *a, float *b, int N); __device__ void add_1(const float *a, float *b, int N); #1 #2 Function calls are also per-lane!
  • 70.
    Spot the difference __device__void add_1(const float *a, float *b, int N); __device__ void add_1(const float *a, float *b, int N); #1 #2 a += offset; b += offset; if (offset < N){ b[i] = a[i] + 1; if (load_idx < N) smem[(tid + 1) % blockDim.x] = ... __syncthreads(); b[idx] = smem[tid] + 1.0f;
  • 71.
    Spot the difference __device__void add_1(const float *a, float *b, int N); __device__ void add_1(const float *a, float *b, int N); #1 #2 a += offset; b += offset; if (offset < N){ b[i] = a[i] + 1; if (load_idx < N) smem[(tid + 1) % blockDim.x] = ... __syncthreads(); b[idx] = smem[tid] + 1.0f; Cannot trust the implementation from looking at the interface
  • 72.
    Even ; isbroken __global__ void foo(const float* a,float* b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ if (threadIdx.x == 0){ __syncthreads(); } // What now? }}
  • 73.
    Even ; isbroken __global__ void foo(const float* a,float* b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ if (threadIdx.x == 0){ __syncthreads(); } // What now? }} Compute hierarchy is implicit in the program
  • 74.
    Anytime a sharedresource is touched, the abstraction breaks
  • 75.
  • 76.
    Missing a mechanism for sharedmemory management Shared Memory L1 Cache Dynamic Shared Memory
  • 77.
    Missing a mechanism for sharedmemory management __device__ void foo_shared() { extern __shared__ char smem[]; float* a_smem = (float*)smem; float* b_smem = (float*)smem + A_SIZE * 4; … }
  • 78.
    Missing a mechanism for sharedmemory management __device__ void foo_shared() { extern __shared__ char smem[]; float* a_smem = (float*)smem; float* b_smem = (float*)smem + A_SIZE * 4; … foo_local(a_smem, b_mem); }
  • 79.
    Missing a mechanism for sharedmemory management No malloc/free! __device__ void foo_shared() { extern __shared__ char smem[]; float* a_smem = (float*)smem; float* b_smem = (float*)smem + A_SIZE * 4; … foo_local(a_smem, b_mem); }
  • 80.
    Memory hierarchy is implicitin the program __device__ void foo_local(float * a, float * b) { } __device__ void foo_shared() { extern __shared__ char smem[]; float* a_smem = (float*)smem; float* b_smem = (float*)smem + A_SIZE * 4; … foo_local(a_smem, b_mem); }
  • 81.
    Memory hierarchy is implicitin the program __device__ void foo_shared() { extern __shared__ char smem[]; float* a_smem = (float*)smem; float* b_smem = (float*)smem + A_SIZE * 4; … foo_local(a_smem, b_mem); } __device__ void foo_local(float * a, float * b) { a[0] = threadIdx.x + b[0]; }
  • 82.
    CUDA programs themachine from the perspective of a single lane until lanes need to collaborate
  • 83.
    CUDA programs themachine from the perspective of a single lane until lanes need to collaborate no modularity… .
  • 84.
    Lack of compositionhas fragmented the ecosystem
  • 85.
    Lack of compositionhas fragmented the ecosystem
  • 86.
    Lack of compositionhas fragmented the ecosystem
  • 87.
    Lack of compositionhas fragmented the ecosystem
  • 88.
    Low-level, safe substratemakes it easier to build productive tools
  • 89.