GPUS and How to Program Them by Manya Bansal

1.
A ScyllaDB Community GPUS,and How to Program Them Manya Bansal Ph.D. Student

2.
A ScyllaDB Community GPUS,and How to Program Them Manya Bansal Ph.D. Student not ^

3.
Source: Dally, B.(2023, August 29). Hardware for Deep Learning. Hot Chips 35 Symposium.

4.
Source: Dally, B.(2023, August 29). Hardware for Deep Learning. Hot Chips 35 Symposium. K20X 2012 3.04 TOPS

5.
Source: Dally, B.(2023, August 29). Hardware for Deep Learning. Hot Chips 35 Symposium. K20X M40 2015 2012 3.04 TOPS 6.84 TOPS FP16 HDP4

6.
Source: Dally, B.(2023, August 29). Hardware for Deep Learning. Hot Chips 35 Symposium. K20X M40 V100 2015 2017 2012 3.04 TOPS 6.84 TOPS 125 TOPS FP16 Tensor Cores FP16 HDP4

7.
Source: Dally, B.(2023, August 29). Hardware for Deep Learning. Hot Chips 35 Symposium. K20X M40 V100 A100 2015 2017 2012 2020 3.04 TOPS 6.84 TOPS 125 TOPS 1238 TOPS FP16 HDP4 Structured Sparsity FP16 Tensor Cores

8.
Source: Dally, B.(2023, August 29). Hardware for Deep Learning. Hot Chips 35 Symposium. K20X M40 V100 A100 H100 2015 2017 2022 2012 2020 3.04 TOPS 6.84 TOPS 125 TOPS 1238 TOPS 4000 TOPS TMA FP16 HDP4 FP8 Tensor Cores Structured Sparsity FP16 Tensor Cores

9.
GPUs fundamentally shiftthe CPU programming model

10.
CPU Execution Model

11.
CPU Execution Model ADD R0, R0,#1 CMP R0, R1 BNE loop SUB R2, R1, #2 MOV R7, #1 MOV R0, #0 SWI 0

12.
CPU Execution Model ADD R0, R0,#1 CMP R0, R1 BNE loop SUB R2, R1, #2 MOV R7, #1 MOV R0, #0 SWI 0 Operating System Software abstractions

13.
Compiler CPU Execution Model ADD R0, R0,#1 CMP R0, R1 BNE loop SUB R2, R1, #2 MOV R7, #1 MOV R0, #0 SWI 0 Operating System Software abstractions

14.
Compiler CPU Execution Model ADD R0, R0,#1 CMP R0, R1 BNE loop SUB R2, R1, #2 MOV R7, #1 MOV R0, #0 SWI 0 Operating System OoO Execution Software abstractions

15.
Compiler CPU Execution Model ADD R0, R0,#1 CMP R0, R1 BNE loop SUB R2, R1, #2 MOV R7, #1 MOV R0, #0 SWI 0 Operating System OoO Execution Caches Software abstractions

16.
Compiler CPU Execution Model ADD R0, R0,#1 CMP R0, R1 BNE loop SUB R2, R1, #2 MOV R7, #1 MOV R0, #0 SWI 0 Operating System OoO Execution Caches Software abstractions Can I get an estimate of how much performance was added?

17.
Massively Parallel Architecture …… … … … … … …

18.
Massively Parallel Architecture …… … … … … … … GPU

19.
Massively Parallel Architecture … SM … …… … SM … … … GPU

20.
Massively Parallel Architecture … SM … …… Block Block … SM … … … Block Block GPU

21.
Massively Parallel Architecture … SM … Warp Warp …… Warp Warp Block Block … SM … Warp Warp … … Warp Warp Block Block GPU

22.
… Warp ADD R0, R0,#1 CMP R0, R1 BNE loop SUB R2, R1, #2 MOV R7, #1 MOV R0, #0 SWI 0

23.
… Warp Instruction Cache LD/ST LD/ST LD/ST TensorCores FP32 Register File Warp Scheduler

24.
… SM … Warp Warp … … Warp Warp Block Block Instruction Cache LD/STLD/ST LD/ST Tensor Cores FP32 Register File Warp Scheduler Instruction Cache LD/ST LD/ST LD/ST Tensor Cores FP32 Register File Warp Scheduler Instruction Cache LD/ST LD/ST LD/ST Tensor Cores FP32 Register File Warp Scheduler Instruction Cache LD/ST LD/ST LD/ST Tensor Cores FP32 Register File Warp Scheduler

25.
Deep Compute Hierarchy … SM … Warp Warp …… Warp Warp Block Block … SM … Warp Warp … … Warp Warp Block Block GPU

26.
Deep Memory Hierarchy … SM … Warp Warp …… Warp Warp Block Block … SM … Warp Warp … … Warp Warp Block Block GPU

27.
Deep Memory Hierarchy … SM … Warp Warp …… Warp Warp Block Block … SM … Warp Warp … … Warp Warp Block Block GPU Global Memory

28.
Deep Memory Hierarchy … SM … Warp Warp …… Warp Warp Block Block … SM … Warp Warp … … Warp Warp Block Block GPU Global Memory Local Memory

29.
Deep Memory Hierarchy … SM … Warp Warp …… Warp Warp Block Block … SM … Warp Warp … … Warp Warp Block Block GPU Global Memory Local Memory Shared Memory

30.
Global Memory CPU 3 TB/S

31.
Global Memory CPU L2 Cache12 TB/S 3 TB/S

32.
Global Memory Shared Memory CPU L1 Cache L2Cache 12 TB/S 33 TB/S 3 TB/S

33.
Global Memory Shared Memory Registers CPU L1 Cache L2Cache 12 TB/S 33 TB/S 3 TB/S

34.
Software has differentresponsibilities ADD R0, R0, #1 CMP R0, R1 BNE loop SUB R2, R1, #2 MOV R7, #1 MOV R0, #0 SWI 0 Operating System OoO Execution Mapping program to compute ADD R0, R0, #1 CMP R0, R1 BNE loop SUB R2, R1, #2 MOV R7, #1 MOV R0, #0 SWI 0 Caches Latency hiding by pipelining & multi-thrd Programmer-managed SRAM, sync Compiler Compiler Caches Application

35.
Programming Languages forGPUs

36.
GPUs can beprogrammed at different levels of abstraction

37.

38.

39.
GPUs can beprogrammed at different levels of abstraction PTX

40.
GPUs can beprogrammed at different levels of abstraction PTX

41.
Our Focus: CUDA

42.
Basically, C++

43.
A Simple Program 1 a b + …

44.
A Simple Program __global__void add_1(const float *a, float *b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ int offset = (blockIdx.x*blockDim.x+threadIdx.x); a += offset; // Move to the thread’s position b += offset; // Move to the thread’s position if (offset < N){ b[i] = a[i] + 1; }}}} 1 a b + …

45.
A Simple Program __global__void add_1(const float *a, float *b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ int offset = (blockIdx.x*blockDim.x+threadIdx.x); a += offset; // Move to the thread’s position b += offset; // Move to the thread’s position if (offset < N){ b[i] = a[i] + 1; }}}} 1 a b + … … …

46.

47.

48.

49.

50.
A Simple Program __global__void add_1(const float* a,float* b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ extern __shared__ float smem[]; int idx = blockIdx.x * blockDim.x + threadIdx.x; int tid = threadIdx.x; // Load with circular shift if (idx < N) { int ld_idx = blockIdx.x* blockDim.x + ((tid + 1) % blockDim.x); if (load_idx < N) smem[(tid + 1) % blockDim.x] = a[ld_idx]; __syncthreads(); b[idx] = smem[tid] + 1.0f; }}} … … … S H A R E D b a

51.

52.

53.

54.

55.

56.

57.

58.
CUDA programs canget complex

59.

60.

61.

62.
H A __global__ voidadd_1(const float* a,float* b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ extern __shared__ float smem[]; int idx = blockIdx.x * blockDim.x + threadIdx.x; int tid = threadIdx.x; // Load with circular shift if (idx < N) { int ld_idx = blockIdx.x* blockDim.x + ((tid + 1) % blockDim.x); if (load_idx < N) smem[(tid + 1) % blockDim.x] = a[ld_idx]; __syncthreads(); b[idx] = smem[tid] + 1.0f; }}} … … … S H A R E D b a A nonsensical program

63.
__global__ void foo(constfloat* a,float* b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ if (threadIdx.x == 0){ __syncthreads(); } // What now? }} A more nonsensical program

64.
CUDA programs themachine from the perspective of a single lane

65.
CUDA programs themachine from the perspective of a single lane until lanes need to collaborate

66.
Our ﬁrst program __device__void add_1(const float *a, float *b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ int offset = (blockIdx.x*blockDim.x+threadIdx.x); a += offset; // Move to the thread’s position b += offset; // Move to the thread’s position if (offset < N){ b[i] = a[i] + 1; }}}} 1 a b + … … …

67.
__device__ void add_1(constfloat* a,float* b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ extern __shared__ float smem[]; int idx = blockIdx.x * blockDim.x + threadIdx.x; int tid = threadIdx.x; // Load with circular shift if (idx < N) { int ld_idx = blockIdx.x* blockDim.x + ((tid + 1) % blockDim.x); if (load_idx < N) smem[(tid + 1) % blockDim.x] = a[ld_idx]; __syncthreads(); b[idx] = smem[tid] + 1.0f; }}} … … … S H A R E D b a Our second program

68.
Spot the difference __device__void add_1(const float *a, float *b, int N); __device__ void add_1(const float *a, float *b, int N); #1 #2

69.
Spot the difference __device__ voidadd_1(const float *a, float *b, int N); __device__ void add_1(const float *a, float *b, int N); #1 #2 Function calls are also per-lane!

70.
Spot the difference __device__void add_1(const float *a, float *b, int N); __device__ void add_1(const float *a, float *b, int N); #1 #2 a += offset; b += offset; if (offset < N){ b[i] = a[i] + 1; if (load_idx < N) smem[(tid + 1) % blockDim.x] = ... __syncthreads(); b[idx] = smem[tid] + 1.0f;

71.
Spot the difference __device__void add_1(const float *a, float *b, int N); __device__ void add_1(const float *a, float *b, int N); #1 #2 a += offset; b += offset; if (offset < N){ b[i] = a[i] + 1; if (load_idx < N) smem[(tid + 1) % blockDim.x] = ... __syncthreads(); b[idx] = smem[tid] + 1.0f; Cannot trust the implementation from looking at the interface

72.
Even ; isbroken __global__ void foo(const float* a,float* b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ if (threadIdx.x == 0){ __syncthreads(); } // What now? }}

73.
Even ; isbroken __global__ void foo(const float* a,float* b, int N){ for(blockIdx.x;blockIdx.x<gridDim.x;blockIdx.x++){ for(threadIdx.x;threadIdx.x<blockDim.x;threadIdx.x++){ if (threadIdx.x == 0){ __syncthreads(); } // What now? }} Compute hierarchy is implicit in the program

74.
Anytime a sharedresource is touched, the abstraction breaks

75.
Missing a mechanism for sharedmemory management

76.
Missing a mechanism for sharedmemory management Shared Memory L1 Cache Dynamic Shared Memory

77.
Missing a mechanism for sharedmemory management __device__ void foo_shared() { extern __shared__ char smem[]; float* a_smem = (float*)smem; float* b_smem = (float*)smem + A_SIZE * 4; … }

78.
Missing a mechanism for sharedmemory management __device__ void foo_shared() { extern __shared__ char smem[]; float* a_smem = (float*)smem; float* b_smem = (float*)smem + A_SIZE * 4; … foo_local(a_smem, b_mem); }

79.
Missing a mechanism for sharedmemory management No malloc/free! __device__ void foo_shared() { extern __shared__ char smem[]; float* a_smem = (float*)smem; float* b_smem = (float*)smem + A_SIZE * 4; … foo_local(a_smem, b_mem); }

80.
Memory hierarchy is implicitin the program __device__ void foo_local(float * a, float * b) { } __device__ void foo_shared() { extern __shared__ char smem[]; float* a_smem = (float*)smem; float* b_smem = (float*)smem + A_SIZE * 4; … foo_local(a_smem, b_mem); }

81.
Memory hierarchy is implicitin the program __device__ void foo_shared() { extern __shared__ char smem[]; float* a_smem = (float*)smem; float* b_smem = (float*)smem + A_SIZE * 4; … foo_local(a_smem, b_mem); } __device__ void foo_local(float * a, float * b) { a[0] = threadIdx.x + b[0]; }

82.
CUDA programs themachine from the perspective of a single lane until lanes need to collaborate

83.
CUDA programs themachine from the perspective of a single lane until lanes need to collaborate no modularity… .

84.
Lack of compositionhas fragmented the ecosystem

85.

86.

87.

88.
Low-level, safe substratemakes it easier to build productive tools

89.
Thank you! Manya Bansal manya227@mit.edu @314Bansal https://manya-bansal.github.io/

GPUS and How to Program Them by Manya Bansal

More Related Content

Similar to GPUS and How to Program Them by Manya Bansal

More from ScyllaDB

Recently uploaded

GPUS and How to Program Them by Manya Bansal