Intro to GPGPU with CUDA (DevLink)

Intro to GPGPU Programing With CUDARob Gillenrob.gillenfamily.net@argodev

Intro to GPGPU Programming with CUDARob Gillen

Welcome!Goals:Overview of GPGPU with CUDA“Vision Casting” for how you can use GPUs to improve your applicationIntroduction to CUDA COutlineWhy GPGPUs?ApplicationsToolingHands-On: Matrix Multiplication

Context SettingLevel of the TalkIntroductory/OverviewPerspective of the Speaker12+ years as professional developer4+ years at Oak Ridge National LaboratoryDisclaimer:Many (most) of these slides are courtesy of NVIDIA corporation although they bear no responsibility for inaccuracies I introduce during this presentation.

CPU vs. GPUGPU devotes more transistors to data processingSpecialized (purpose-designed) Silicon

NVIDIA Fermi~1.5TFLOPS (SP)/~800GFLOPS (DP)230 GB/s DRAM Bandwidth

MotivationFLoating-Point Operations per Second (FLOPS) and memory bandwidth For the CPU and GPU

Example: Sparse Matrix-VectorCPU Results from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007

Rayleigh-Bénard ResultsDouble precision384 x 384 x 192 grid (max that fits in 4GB)Vertical slice of temperature at y=0Transition from stratified (left) to turbulent (right)Regime depends on Rayleigh number: Ra = gαΔT/κν8.5x speedup versus Fortran code running on 8-core 2.5 GHz Xeon

G80 Characteristics367 GFLOPS peak performance (25-50 times of current high-end microprocessors)265 GFLOPS sustained for apps such as VMDMassively parallel, 128 cores, 90WMassively threaded, sustains 1000s of threads per app30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics

ApplicationsExciting applications in future mass computing market have been traditionally considered “supercomputing applications”Molecular dynamics simulation, Video and audio codingand manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products These “Super-apps” represent and model physical, concurrent worldVarious granularities of parallelism exist, but…programming model must not hinder parallel implementationdata delivery needs careful management

*Not* for all applicationsSPMD (Single Program, Multiple Data) are best (data parallel)Operations need to be of sufficient size to overcome overheadThink Millions of operations.

ToolingVS 2010 C++ (Express is OK… sort-of.)NVIDIA CUDA-Capable GPUNVIDIA CUDA Toolkit (v4+)NVIDIA CUDA Tools (v4+)GPU Computing SDKNVIDIA Parallel Insight

Outline of CUDA BasicsBasic Memory ManagementBasic Kernels and Execution on GPUDevelopment ResourcesSee the Programming Guide for the full APISee the Getting Started Guide for installation and compilation instructionsBoth guest are included in the toolkit

Memory SpacesCPU and GPU have separate memory spacesData is moved across PCIe busUse functions to allocate/set/copy memory on GPUVery similar to corresponding C functionsPointers are just addressesCan’t tell from the pointer value whether the address is on CPU or GPUMust exercise care when dereferencing:Dereferencing CPU pointer on GPU will likely crashConverse is also true

GPU Memory Allocation / ReleaseHost (CPU) manages device (GPU) memory:cudaMalloc (void ** pointer, size_tnbytes)cudaMemset (void * pointer, int value, size_t count)cudaFree(void* pointer)int n = 1024;intnbytes = 1024*sizeof(int);int * d_a = 0;cudaMalloc( (void**)&d_a, nbytes);cudaMemset(d_a, 0, nbytes);cudaFree(d_a);

Data CopiesCudamemcpy(void *dst, void *src, size_tnbytes, enumcudaMemcpyKinddirection);Returns after copy is completeBlocks CPU thread until all bytes have been copiedDoesn’t start copying until previous CUDA calls completeEnumcudaMemcpyKindcudaMemcpyHostToDevicecudaMemcpyDeviceToHostcudaMemcpyDeviceToDeviceNon-blocking memcopies are provided

CUDA Programming ModelParallel code (kernel) is launched and executed on a device by many threadsThreads are grouped into thread blocksParallel code is written for a threadEach thread is free to execute a unique code pathBuilt-in thread and block ID variables

Thread HierarchyThreads launched for a parallel section are partitioned into thread blocksGrid == all blocks for a given launchThread block is a group of threads that can:Synchronize their executionCommunicate via shared memoryThreads  Thread Blocks  Grid

Block IDs and ThreadsThreads:3D IDs, Unique within a blockBlocks:2D IDs, unique within a gridDimensions set at launch timeCan be unique for each gridBuilt-in variables:threadIdx, blockIdxblockDim, gridDim

Code executed on GPUC function with some restrictionsReturn voidCan only dereference GPU pointersNo static variablesSome additional restrictions for older GPUsMust be declared with a qualifier:__global__ : launched by CPU, cannot be called from GPU, must return void__device__ : called from other GPU functions, cannot be launched by the CPU__host__ : can be executed only by the CPU__host__ and __device__ qualifiers can be combined

Launching kernels on GPULaunch Parameters:Grid dimensions (up to 2D), dim3 typeThread-block dimensions (up to 3D), dim3 typeShared memory: number of bytes per blockFor extern smem variables declared without sizeOptional, 0 by defaultStream IDOptional, 0 by defaultdim3 grid(16, 16);dim3 block(16, 16);kernel<<<grid, block, 0, 0>>>(…);kernel<<<32, 512>>>(…);

Kernel Variations and Output__global__ void kernel (int*a){intidx = blockIdx.x * blockDim.x + threadIdx.x; a[idx] = 7;} Output: 7777777777777777__global__ void kernel (int *a){intidx = blockIdx.x * blockDim.x + threadIdx.x; a[idx] = blockIdx.x;} Output: 000011112222333__global__ void kernel (int *a){intidx = blockIdx.x * blockDim.x + threadIdx.x; a[idx] = threadIdx.x;} Output: 0123012301230123

Code Walkthrough 3Build on Walkthrough 2Write kernel to incrementnxmintegersCopy the result back to CPUPrint the values

Blocks must be independentAny possible interleaving of blocks should be validPresumed to run to completion without pre-emptionCan run in any orderCan run concurrently OR sequentiallyBlocks may coordinate but not synchronizeShared queue pointer: OKShared lock: BAD … can easily deadlockIndependence requirement gives scalability

Transparent ScalabilityHardware is free to assigns blocks to any processor at any timeA kernel scales across any number of parallel processorsKernel gridDeviceBlock 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7DeviceBlock 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7Block 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7timeEach block can execute in any order relative to other blocks.

Extended ExampleMatrix Multiplication

A Simple Running ExampleMatrix MultiplicationA simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programsLeave shared memory usage until laterLocal, register usageThread ID usageMemory data transfer API between host and deviceAssume square matrix for simplicity

Programming Model:Square Matrix Multiplication ExampleP = M * N of size WIDTH x WIDTHWithout tiling:One thread calculates one element of PM and N are loaded WIDTH timesfrom global memoryNWIDTHMPWIDTHWIDTHWIDTH40

Memory Layout of Matrix in CM0,2M0,1M0,0M0,3M1,1M1,0M1,2M1,3M2,1M2,0M2,2M2,3M3,1M3,0M3,2M3,3MM0,2M0,1M0,0M0,3M1,1M1,0M1,2M1,3M2,1M2,0M2,2M2,3M3,1M3,0M3,2M3,3

Simple Matrix Multiplication (CPU)void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏{ for (int i = 0; i < Width; ++i) {‏ for (int j = 0; j < Width; ++j) { float sum = 0; for (int k = 0; k < Width; ++k) { float a = M[i * width + k];float b = N[k * width + j];sum += a * b;}P[i * Width + j] = sum; } }}NkjWIDTHMPiWIDTHk42WIDTHWIDTH

Simple Matrix Multiplication (GPU)void MatrixMulOnDevice(float* M, float* N, float* P, int Width)‏{intsize = Width * Width * sizeof(float); float* Md, Nd, Pd; … // 1. Allocate and Load M, N to device memory cudaMalloc(&Md, size);cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);cudaMalloc(&Nd, size);cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);// Allocate P on the devicecudaMalloc(&Pd, size);

Simple Matrix Multiplication (GPU) // 2. Kernel invocation code – to be shown later … // 3. Read P from the devicecudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); // Free device matricescudaFree(Md); cudaFree(Nd); cudaFree(Pd);}

Kernel Function// Matrix multiplication kernel – per thread code__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏{ // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0;

Kernel Function (contd.)for (int k = 0; k < Width; ++k)‏ {float Melement = Md[threadIdx.y*Width+k];float Nelement = Nd[k*Width+threadIdx.x];Pvalue+= Melement * Nelement; }Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;}NdkWIDTHtxMdPdtytyWIDTHtxk46WIDTHWIDTH

Kernel Function (full)// Matrix multiplication kernel – per thread code__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏{ // Pvalue is used to store the element of the matrix// that is computed by the threadfloat Pvalue = 0; for (int k = 0; k < Width; ++k)‏ { float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x];Pvalue += Melement * Nelement; }Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;}

Kernel Invocation (Host Side) // Setup the execution configurationdim3 dimGrid(1, 1);dim3 dimBlock(Width, Width);// Launch the device computation threads!MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

Only One Thread Block UsedNdGrid 1One Block of threads compute matrix PdEach thread computes one element of PdEach threadLoads a row of matrix MdLoads a column of matrix NdPerform one multiply and addition for each pair of Md and Nd elementsCompute to off-chip memory access ratio close to 1:1 (not very high)‏Size of matrix limited by the number of threads allowed in a thread blockBlock 1Thread(2, 2)‏48 WIDTHPdMd

Handling Arbitrary Sized Square MatricesHave each 2D thread block to compute a (TILE_WIDTH)2 sub-matrix (tile) of the result matrixEach has (TILE_WIDTH)2 threadsGenerate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocksNdWIDTHMdPdbyYou still need to put a loop around the kernel call for cases where WIDTH/TILE_WIDTH is greater than max grid size (64K)!TILE_WIDTHtyWIDTHbxtx50WIDTHWIDTH

Small ExampleNd1,0Nd0,0Block(0,0)Block(1,0)Nd1,1Nd0,1P1,0P0,0P2,0P3,0Nd1,2Nd0,2TILE_WIDTH = 2P0,1P1,1P3,1P2,1Nd0,3Nd1,3P0,2P2,2P3,2P1,2P0,3P2,3P3,3P1,3Pd1,0Md2,0Md1,0Md0,0Md3,0Pd0,0Pd2,0Pd3,0Md1,1Md0,1Md2,1Md3,1Pd0,1Pd1,1Pd3,1Pd2,1Block(1,1)Block(0,1)Pd0,2Pd2,2Pd3,2Pd1,2Pd0,3Pd2,3Pd3,3Pd1,3

Cleanup TopicsMemory ManagementPinned Memory (Zero-Transfer)Portable Pinned MemoryMulti-GPUWrappers (Python, Java, .NET)KernelsAtomicsThread Synchronization (staged reductions)NVCC

Questions?rob@gillenfamily.net@argodevhttp://rob.gillenfamily.net

Intro to GPGPU with CUDA (DevLink)

More Related Content

What's hot

Viewers also liked

Similar to Intro to GPGPU with CUDA (DevLink)

More from Rob Gillen

Recently uploaded

Intro to GPGPU with CUDA (DevLink)

Editor's Notes