GPU-Accelerated Parallel Computing

GPU-Accelerated
Parallel Computing
RTSS Jun Young Park

References
 Intro to Parallel Computing
– Udacity
 CUDA by Example – Jason
Sanders, Edward Kandrot

Why GPU?
T T
Core
T T
Core
T T
Core
T T
Core
T T
Core
T T
Core
…
3584 cores
Good for few huge tasks Good for enormous small tasks
3.6 GHz
1.531 MHz

Measuring Performance
CPU – Latency
 How long does it take for a work
GPU - Throughput
 How many tasks per hour
Data Size : 4.5[GB]
Assume that …
CPU can process 2 tasks at the time and each core processes 200 [MB/h]
GPU can process 40 tasks at the time and each core processes 50 [MB/h]
Latency
CPU : 4500/200 [MB/h] = 22.5 [Hours]
GPU : 4500/50 [MB/h] = 90 [Hours]
Throughput
CPU : 2[Tasks]/22.5[Hours] = 0.089[Tasks/Hour]
GPU : 40[Tasks]/90[Hours] = 0.445[Tasks/Hour]
Better !
Better !

CUDA Program Diagram
CPU GPU
Memory MemorycudaMemcpy()
cudaMalloc()
__global__ hello()
hello.cu
NVCC
Co-processor

Typical Procedure
CPU allocates memory on GPU
• cudaMalloc((void **)pointer, size);
CPU copies input data from CPU to GPU
• cudaMemcpy(dest, &src, size, cudaMemcpyHostToDevice)
CPU launches kernel on GPU
• Kernel<<<N_BLOCKS,N_THREADS>>>(args…)
CPU copies results back to CPU from GPU
• cudaMemcpy(dest, &src, size, cudaMemcpyDeviceToHost)

CUDA Example - Addiction
- Single Thread (1)
The pointers will be indicate GPU memory space
Allocate memory for each pointers
Copy from CPU -> GPU

- Single Thread (2)
Kernel : Will be executed in GPU
CPU GPU
d_a
d_b
d_out
h_a
h_b
h_out
1.Memcpy
sum
2.Kernal call
3.d_out updated
4.Memcpy

- Single Thread (3)
 Compilation using NVCC
 Execution result

CUDA Example – Cubic
- Multi Thread (1)
To be used for determining
the size of the memory space
Initialize the elements
with each array index.

CUDA Example - Cubic
- Multi Thread (2)
Kernel call with SIZE_ARRAY threads.
To acquire current thread index

CUDA Example - Cubic
- Multi Thread (3)

Small Project
- cuCUDAn
99x99 Dan with every single multiplication threads.
Total 10,000 threads for multiplication. (0x0 … 99x99)

cuCUDAn
- Implementation
idx Value
0 0
1 1
2 2
3 3
4 4
… …
99 99
idx Value
0 0
1 1
2 2
3 3
4 4
… …
99 99
X T[0] T[1] T[2] T[3] T[…] T[99]
B[0] 0 0 0 0 … 0
B[1] 0 1 2 3 … 99
B[2] 0 2 4 6 … 198
B[3] 0 3 6 9 … 297
B[…] … … … … … …
B[99] 0 99 198 297 … 9801
blockDim.x
Limitation : Maximum 512/1024 threads for a block.

cuCUDAn
- Implementation
dim(d_out) = [100 * 100]
Using 1D-Matrix instead of 2D-Matrix

cuCUDAn
- Implementation
Initialize cudaEvent objects.
mul<<<numBlocks, numThreads>>>
Launch the kernel between recorder.
Wait until the kernel finished.
Get elapsed time.

cuCUDAn
- Implementation
Acquire the result from the GPU
Showing results.
Free-up memory spaces.

cuCUDAn
- Result
……
Multiplication in each threads.
※ Total : 10,000 Threads
Elapsed time : 7.168 [μs]

cuCUDAn
- Result
It feels dizzy …
However, It works well!

Self-Check
 Which kind of task do CPUs and GPUs specialized for?
 Show the way to qualify for the CPUs and GPUs.
 Describe the basic procedure of CUDA programs.
 Describe the procedure how to measure elapsed time using cudaEvent object.

Communication Patterns
 Map : 1-to-1 matching.
 Gather : Many-to-1 matching.
 Scatter : 1-to-Many matching.
 *Stencil : Input from fixed neighborhood in array
Map Gather Scatter Stencil

Communication Patterns
 Transpose : 1-to-1 matching.
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
1 6 11
2 7 12
3 8 13
4 9 14
5 10 15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 6 11 2 7 12 3 8 13 4 9 14 5 10 15

Thread Block and SM
Block Block Block Block
Threads
Kernel
Stream Multiprocessor (in Titan V)
Memory
Cores
Mapped for block (1 or more)

Memory Structure – Programmer’s View
Thread
1
Local
Thread
2
Local
Thread
3
Local
Thread
N
Local
Shared
Thread
Local
Thread
Local
Thread
Local
Thread
Local
Shared
Thread
Local
Thread
Local
Thread
Local
Thread
Local
Shared
Thread
Local
Thread
Local
Thread
Local
Thread
Local
Shared
Global Memory
Block 1 Block 2 Block 3 Block N
GPU

Synchronization
- Barrier
 Barrier : Wait until all of the operations finished.
B
a
r
r
i
e
r
Threads
__syncThreads()

Synchronization
- Example
0 1 2 3 4 5 6 7 …
1 2 3 4 5 6 7 8 …
Shifting…….
• Quiz from Lesson 2.
• Each thread performs ‘Shift Operation’.

Synchronization
- Example
Copy from Shared to Global
＊To send result to the host.

Synchronization
- Example
• Without Barrier : Only one element(thread) is filled with the index (Don’t wait for other threads)
• With Barrier : Each elements are filled with the index (Wait until other elements filled)

Memory Example
- Local
Can only be used in the thread

Memory Example
- Global
Local
Vars.
Pointing the global memory allocated in elsewhere.

Memory Example
- Shared
Declare shared array.
Set barrier for the line above.
(Writing operation)

Atomicity
 CUDA supports atomic operations.
 atomicAdd(), atomicCAS(), atomicXor() … and so on
 Limitations
 Only certain data types, operations.
 No ordering constraints.
 May slow down.

Memory Management Strategies
 Maximize arithmetic intensity
 Maximize compute operations per thread
 Minimize time spent on memory per thread
 Move frequently-accessed data to fast memory
Local
Shared
Global
Host
[Access Speed from Core]

Coalesce Memory Access
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T
h
r
e
a
d
T
h
r
e
a
d
T
h
r
e
a
d
T
h
r
e
a
d
T
h
r
e
a
d
T
h
r
e
a
d
T
h
r
e
a
d
T
h
r
e
a
d
Coalesce Stride (Not coalesced)
Global Memory
with single transaction with multiple transaction

Thread Divergence
 Each threads do different works.
T1
T2
T3
T4
T1
T2
T3
T4
IF-THEN
ELSE

Thread Divergence (Warp Divergence1))
 Assume that a thread loops the code for its thread index.
 Each threads wait until the other threads finished.
1) https://people.maths.ox.ac.uk/gilesm/cuda/lecs/lec3-2x2.pdf - Lecture note from Prof. Mike Giles (Oxford University)
T1
T2
T3
T4
1 2 3 40 0
Pre Loop Loop 1 Loop 2 Loop 3 Loop 4 Post Loop

Self Check
 Communication Patterns
 Memory Structure in CUDA
 Synchronization
 Atomicity
 Memory Management Strategies

GPU-Accelerated Parallel Computing

More Related Content

What's hot

Similar to GPU-Accelerated Parallel Computing

More from Jun Young Park

Recently uploaded

GPU-Accelerated Parallel Computing

Editor's Notes