Intro to GPGPU Programing With CUDARob Gillenrob.gillenfamily.net@argodev
Intro to GPGPU Programming with CUDARob Gillen
Welcome!Goals:Overview of GPGPU with CUDA“Vision Casting” for how you can use GPUs to improve your applicationIntroduction to CUDA COutlineWhy GPGPUs?ApplicationsToolingHands-On: Matrix Multiplication
Context SettingLevel of the TalkIntroductory/OverviewPerspective of the Speaker12+ years as professional developer4+ years at Oak Ridge National LaboratoryDisclaimer:Many (most) of these slides are courtesy of NVIDIA corporation although they bear no responsibility for inaccuracies I introduce during this presentation.
Why Use GPUs?Motivation
CPU vs. GPUGPU devotes more transistors to data processingSpecialized (purpose-designed) Silicon
NVIDIA Fermi~1.5TFLOPS (SP)/~800GFLOPS (DP)230 GB/s DRAM Bandwidth
MotivationFLoating-Point Operations per Second (FLOPS) and memory bandwidth For the CPU and GPU
Example: Sparse Matrix-VectorCPU Results from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms",  Williams et al, Supercomputing 2007
Rayleigh-Bénard ResultsDouble precision384 x 384 x 192 grid (max that fits in 4GB)Vertical slice of temperature at y=0Transition from stratified (left) to turbulent (right)Regime depends on Rayleigh number: Ra = gαΔT/κν8.5x speedup versus Fortran code running on 8-core 2.5 GHz Xeon
G80 Characteristics367 GFLOPS  peak performance (25-50 times of current high-end microprocessors)265 GFLOPS sustained for apps such as VMDMassively parallel, 128 cores, 90WMassively threaded, sustains 1000s of threads per app30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics
Supercomputer Comparison
ApplicationsExciting applications in future mass computing market have been traditionally considered “supercomputing applications”Molecular dynamics simulation, Video and audio codingand manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products These “Super-apps” represent and model physical, concurrent worldVarious granularities of parallelism exist, but…programming model must not hinder parallel implementationdata delivery needs careful management
*Not* for all applicationsSPMD (Single Program, Multiple Data) are best (data parallel)Operations need to be of sufficient size to overcome overheadThink Millions of operations.
Raytracing
NVIRT: CUDA Ray Tracing API
ToolingVS 2010 C++ (Express is OK… sort-of.)NVIDIA CUDA-Capable GPUNVIDIA CUDA Toolkit (v4+)NVIDIA CUDA Tools (v4+)GPU Computing SDKNVIDIA Parallel Insight
Parallel Debugging
Parallel Analysis
VS Project Templates
VS Project Templates
Outline of CUDA BasicsBasic Memory ManagementBasic Kernels and Execution on GPUDevelopment ResourcesSee the Programming Guide for the full APISee the Getting Started Guide for installation and compilation instructionsBoth guest are included in the toolkit
Memory SpacesCPU and GPU have separate memory spacesData is moved across PCIe busUse functions to allocate/set/copy memory on GPUVery similar to corresponding C functionsPointers are just addressesCan’t tell from the pointer value whether the address is on CPU or GPUMust exercise care when dereferencing:Dereferencing CPU pointer on GPU will likely crashConverse is also true
GPU Memory Allocation / ReleaseHost (CPU) manages device (GPU) memory:cudaMalloc (void ** pointer, size_tnbytes)cudaMemset (void * pointer, int value, size_t count)cudaFree(void* pointer)int n = 1024;intnbytes = 1024*sizeof(int);int * d_a = 0;cudaMalloc( (void**)&d_a, nbytes);cudaMemset(d_a, 0, nbytes);cudaFree(d_a);
Data CopiesCudamemcpy(void *dst, void *src, size_tnbytes, enumcudaMemcpyKinddirection);Returns after copy is completeBlocks CPU thread until all bytes have been copiedDoesn’t start copying until previous CUDA calls completeEnumcudaMemcpyKindcudaMemcpyHostToDevicecudaMemcpyDeviceToHostcudaMemcpyDeviceToDeviceNon-blocking memcopies are provided
DemoCode Walkthrough 1
CUDA Programming ModelParallel code (kernel) is launched and executed on a device by many threadsThreads are grouped into thread blocksParallel code is written for a threadEach thread is free to execute a unique code pathBuilt-in thread and block ID variables
Thread HierarchyThreads launched for a parallel section are partitioned into thread blocksGrid == all blocks for a given launchThread block is a group of threads that can:Synchronize their executionCommunicate via shared memoryThreads  Thread Blocks  Grid
Block IDs and ThreadsThreads:3D IDs, Unique within a blockBlocks:2D IDs, unique within a gridDimensions set at launch timeCan be unique for each gridBuilt-in variables:threadIdx, blockIdxblockDim, gridDim
Code executed on GPUC function with some restrictionsReturn voidCan only dereference GPU pointersNo static variablesSome additional restrictions for older GPUsMust be declared with a qualifier:__global__ : launched by CPU, cannot be called from GPU, must return void__device__ : called from other GPU functions, cannot be launched by the CPU__host__ : can be executed only by the CPU__host__ and __device__ qualifiers can be combined
DemoCode Walkthrough 2
Launching kernels on GPULaunch Parameters:Grid dimensions (up to 2D), dim3 typeThread-block dimensions (up to 3D), dim3 typeShared memory: number of bytes per blockFor extern smem variables declared without sizeOptional, 0 by defaultStream IDOptional, 0 by defaultdim3 grid(16, 16);dim3 block(16, 16);kernel<<<grid, block, 0, 0>>>(…);kernel<<<32, 512>>>(…);
Kernel Variations and Output__global__ void kernel (int*a){intidx = blockIdx.x * blockDim.x + threadIdx.x;   a[idx] = 7;}						Output: 7777777777777777__global__ void kernel (int *a){intidx = blockIdx.x * blockDim.x + threadIdx.x;    a[idx] = blockIdx.x;}						Output: 000011112222333__global__ void kernel (int *a){intidx = blockIdx.x * blockDim.x + threadIdx.x;    a[idx] = threadIdx.x;}						Output: 0123012301230123
Code Walkthrough 3Build on Walkthrough 2Write kernel to incrementnxmintegersCopy the result back to CPUPrint the values
DemoCode Walkthrough 3
Blocks must be independentAny possible interleaving of blocks should be validPresumed to run to completion without pre-emptionCan run in any orderCan run concurrently OR sequentiallyBlocks may coordinate but not synchronizeShared queue pointer: OKShared lock: BAD … can easily deadlockIndependence requirement gives scalability
Transparent ScalabilityHardware is free to assigns blocks to any processor at any timeA kernel scales across any number of parallel processorsKernel gridDeviceBlock 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7DeviceBlock 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7Block 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7timeEach block can execute in any order relative to other blocks.
Extended ExampleMatrix Multiplication
A Simple Running ExampleMatrix MultiplicationA simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programsLeave shared memory usage until laterLocal, register usageThread ID usageMemory data transfer API between host and deviceAssume square matrix for simplicity
Programming Model:Square Matrix Multiplication ExampleP = M * N of size WIDTH x WIDTHWithout tiling:One thread calculates one element of PM and N are loaded WIDTH timesfrom global memoryNWIDTHMPWIDTHWIDTHWIDTH40
Memory Layout of Matrix in CM0,2M0,1M0,0M0,3M1,1M1,0M1,2M1,3M2,1M2,0M2,2M2,3M3,1M3,0M3,2M3,3MM0,2M0,1M0,0M0,3M1,1M1,0M1,2M1,3M2,1M2,0M2,2M2,3M3,1M3,0M3,2M3,3
Simple Matrix Multiplication (CPU)void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏{    for (int i = 0; i < Width; ++i) {‏  for (int j = 0; j < Width; ++j) {	     float sum = 0;     for (int k = 0; k < Width; ++k) {       float a = M[i * width + k];float b = N[k * width + j];sum += a * b;}P[i * Width + j] = sum;   } }}NkjWIDTHMPiWIDTHk42WIDTHWIDTH
Simple Matrix Multiplication (GPU)void MatrixMulOnDevice(float* M, float* N, float* P, int Width)‏{intsize = Width * Width * sizeof(float);  float* Md, Nd, Pd;   …  // 1. Allocate and Load M, N to device memory cudaMalloc(&Md, size);cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);cudaMalloc(&Nd, size);cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);// Allocate P on the devicecudaMalloc(&Pd, size);
Simple Matrix Multiplication (GPU)  // 2. Kernel invocation code – to be shown later     … // 3. Read P from the devicecudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); // Free device matricescudaFree(Md); cudaFree(Nd); cudaFree(Pd);}
Kernel Function// Matrix multiplication kernel – per thread code__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏{    // Pvalue is used to store the element of the matrix    // that is computed by the thread    float Pvalue = 0;
Kernel Function (contd.)for (int k = 0; k < Width; ++k)‏ {float Melement = Md[threadIdx.y*Width+k];float Nelement = Nd[k*Width+threadIdx.x];Pvalue+= Melement * Nelement;   }Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;}NdkWIDTHtxMdPdtytyWIDTHtxk46WIDTHWIDTH
Kernel Function (full)// Matrix multiplication kernel – per thread code__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏{   // Pvalue is used to store the element of the matrix// that is computed by the threadfloat Pvalue = 0; for (int k = 0; k < Width; ++k)‏ {     float Melement = Md[threadIdx.y*Width+k];     float Nelement = Nd[k*Width+threadIdx.x];Pvalue += Melement * Nelement;   }Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;}
Kernel Invocation (Host Side) // Setup the execution configurationdim3 dimGrid(1, 1);dim3 dimBlock(Width, Width);// Launch the device computation threads!MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);
Only One Thread Block UsedNdGrid 1One Block of threads compute matrix PdEach thread computes one element of PdEach threadLoads a row of matrix MdLoads a column of matrix NdPerform one multiply and addition for each pair of Md and Nd elementsCompute to off-chip memory access ratio close to 1:1 (not very high)‏Size of matrix limited by the number of threads allowed in a thread blockBlock 1Thread(2, 2)‏48   WIDTHPdMd
Handling Arbitrary Sized Square MatricesHave each 2D thread block to compute a (TILE_WIDTH)2 sub-matrix (tile) of the result matrixEach has (TILE_WIDTH)2 threadsGenerate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocksNdWIDTHMdPdbyYou still need to put a loop around the kernel call for cases where WIDTH/TILE_WIDTH is greater than max grid size (64K)!TILE_WIDTHtyWIDTHbxtx50WIDTHWIDTH
Small ExampleNd1,0Nd0,0Block(0,0)Block(1,0)Nd1,1Nd0,1P1,0P0,0P2,0P3,0Nd1,2Nd0,2TILE_WIDTH = 2P0,1P1,1P3,1P2,1Nd0,3Nd1,3P0,2P2,2P3,2P1,2P0,3P2,3P3,3P1,3Pd1,0Md2,0Md1,0Md0,0Md3,0Pd0,0Pd2,0Pd3,0Md1,1Md0,1Md2,1Md3,1Pd0,1Pd1,1Pd3,1Pd2,1Block(1,1)Block(0,1)Pd0,2Pd2,2Pd3,2Pd1,2Pd0,3Pd2,3Pd3,3Pd1,3
Cleanup TopicsMemory ManagementPinned Memory (Zero-Transfer)Portable Pinned MemoryMulti-GPUWrappers (Python, Java, .NET)KernelsAtomicsThread Synchronization (staged reductions)NVCC
Questions?rob@gillenfamily.net@argodevhttp://rob.gillenfamily.net

Intro to GPGPU with CUDA (DevLink)

  • 1.
    Intro to GPGPUPrograming With CUDARob Gillenrob.gillenfamily.net@argodev
  • 2.
    Intro to GPGPUProgramming with CUDARob Gillen
  • 3.
    Welcome!Goals:Overview of GPGPUwith CUDA“Vision Casting” for how you can use GPUs to improve your applicationIntroduction to CUDA COutlineWhy GPGPUs?ApplicationsToolingHands-On: Matrix Multiplication
  • 4.
    Context SettingLevel ofthe TalkIntroductory/OverviewPerspective of the Speaker12+ years as professional developer4+ years at Oak Ridge National LaboratoryDisclaimer:Many (most) of these slides are courtesy of NVIDIA corporation although they bear no responsibility for inaccuracies I introduce during this presentation.
  • 5.
  • 6.
    CPU vs. GPUGPUdevotes more transistors to data processingSpecialized (purpose-designed) Silicon
  • 7.
    NVIDIA Fermi~1.5TFLOPS (SP)/~800GFLOPS(DP)230 GB/s DRAM Bandwidth
  • 8.
    MotivationFLoating-Point Operations perSecond (FLOPS) and memory bandwidth For the CPU and GPU
  • 9.
    Example: Sparse Matrix-VectorCPUResults from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007
  • 10.
    Rayleigh-Bénard ResultsDouble precision384x 384 x 192 grid (max that fits in 4GB)Vertical slice of temperature at y=0Transition from stratified (left) to turbulent (right)Regime depends on Rayleigh number: Ra = gαΔT/κν8.5x speedup versus Fortran code running on 8-core 2.5 GHz Xeon
  • 11.
    G80 Characteristics367 GFLOPS peak performance (25-50 times of current high-end microprocessors)265 GFLOPS sustained for apps such as VMDMassively parallel, 128 cores, 90WMassively threaded, sustains 1000s of threads per app30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics
  • 12.
  • 13.
    ApplicationsExciting applications infuture mass computing market have been traditionally considered “supercomputing applications”Molecular dynamics simulation, Video and audio codingand manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products These “Super-apps” represent and model physical, concurrent worldVarious granularities of parallelism exist, but…programming model must not hinder parallel implementationdata delivery needs careful management
  • 14.
    *Not* for allapplicationsSPMD (Single Program, Multiple Data) are best (data parallel)Operations need to be of sufficient size to overcome overheadThink Millions of operations.
  • 15.
  • 16.
    NVIRT: CUDA RayTracing API
  • 17.
    ToolingVS 2010 C++(Express is OK… sort-of.)NVIDIA CUDA-Capable GPUNVIDIA CUDA Toolkit (v4+)NVIDIA CUDA Tools (v4+)GPU Computing SDKNVIDIA Parallel Insight
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
    Outline of CUDABasicsBasic Memory ManagementBasic Kernels and Execution on GPUDevelopment ResourcesSee the Programming Guide for the full APISee the Getting Started Guide for installation and compilation instructionsBoth guest are included in the toolkit
  • 23.
    Memory SpacesCPU andGPU have separate memory spacesData is moved across PCIe busUse functions to allocate/set/copy memory on GPUVery similar to corresponding C functionsPointers are just addressesCan’t tell from the pointer value whether the address is on CPU or GPUMust exercise care when dereferencing:Dereferencing CPU pointer on GPU will likely crashConverse is also true
  • 24.
    GPU Memory Allocation/ ReleaseHost (CPU) manages device (GPU) memory:cudaMalloc (void ** pointer, size_tnbytes)cudaMemset (void * pointer, int value, size_t count)cudaFree(void* pointer)int n = 1024;intnbytes = 1024*sizeof(int);int * d_a = 0;cudaMalloc( (void**)&d_a, nbytes);cudaMemset(d_a, 0, nbytes);cudaFree(d_a);
  • 25.
    Data CopiesCudamemcpy(void *dst,void *src, size_tnbytes, enumcudaMemcpyKinddirection);Returns after copy is completeBlocks CPU thread until all bytes have been copiedDoesn’t start copying until previous CUDA calls completeEnumcudaMemcpyKindcudaMemcpyHostToDevicecudaMemcpyDeviceToHostcudaMemcpyDeviceToDeviceNon-blocking memcopies are provided
  • 26.
  • 27.
    CUDA Programming ModelParallelcode (kernel) is launched and executed on a device by many threadsThreads are grouped into thread blocksParallel code is written for a threadEach thread is free to execute a unique code pathBuilt-in thread and block ID variables
  • 28.
    Thread HierarchyThreads launchedfor a parallel section are partitioned into thread blocksGrid == all blocks for a given launchThread block is a group of threads that can:Synchronize their executionCommunicate via shared memoryThreads  Thread Blocks  Grid
  • 29.
    Block IDs andThreadsThreads:3D IDs, Unique within a blockBlocks:2D IDs, unique within a gridDimensions set at launch timeCan be unique for each gridBuilt-in variables:threadIdx, blockIdxblockDim, gridDim
  • 30.
    Code executed onGPUC function with some restrictionsReturn voidCan only dereference GPU pointersNo static variablesSome additional restrictions for older GPUsMust be declared with a qualifier:__global__ : launched by CPU, cannot be called from GPU, must return void__device__ : called from other GPU functions, cannot be launched by the CPU__host__ : can be executed only by the CPU__host__ and __device__ qualifiers can be combined
  • 31.
  • 32.
    Launching kernels onGPULaunch Parameters:Grid dimensions (up to 2D), dim3 typeThread-block dimensions (up to 3D), dim3 typeShared memory: number of bytes per blockFor extern smem variables declared without sizeOptional, 0 by defaultStream IDOptional, 0 by defaultdim3 grid(16, 16);dim3 block(16, 16);kernel<<<grid, block, 0, 0>>>(…);kernel<<<32, 512>>>(…);
  • 33.
    Kernel Variations andOutput__global__ void kernel (int*a){intidx = blockIdx.x * blockDim.x + threadIdx.x; a[idx] = 7;} Output: 7777777777777777__global__ void kernel (int *a){intidx = blockIdx.x * blockDim.x + threadIdx.x; a[idx] = blockIdx.x;} Output: 000011112222333__global__ void kernel (int *a){intidx = blockIdx.x * blockDim.x + threadIdx.x; a[idx] = threadIdx.x;} Output: 0123012301230123
  • 34.
    Code Walkthrough 3Buildon Walkthrough 2Write kernel to incrementnxmintegersCopy the result back to CPUPrint the values
  • 35.
  • 36.
    Blocks must beindependentAny possible interleaving of blocks should be validPresumed to run to completion without pre-emptionCan run in any orderCan run concurrently OR sequentiallyBlocks may coordinate but not synchronizeShared queue pointer: OKShared lock: BAD … can easily deadlockIndependence requirement gives scalability
  • 37.
    Transparent ScalabilityHardware isfree to assigns blocks to any processor at any timeA kernel scales across any number of parallel processorsKernel gridDeviceBlock 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7DeviceBlock 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7Block 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7timeEach block can execute in any order relative to other blocks.
  • 38.
  • 39.
    A Simple RunningExampleMatrix MultiplicationA simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programsLeave shared memory usage until laterLocal, register usageThread ID usageMemory data transfer API between host and deviceAssume square matrix for simplicity
  • 40.
    Programming Model:Square MatrixMultiplication ExampleP = M * N of size WIDTH x WIDTHWithout tiling:One thread calculates one element of PM and N are loaded WIDTH timesfrom global memoryNWIDTHMPWIDTHWIDTHWIDTH40
  • 41.
    Memory Layout ofMatrix in CM0,2M0,1M0,0M0,3M1,1M1,0M1,2M1,3M2,1M2,0M2,2M2,3M3,1M3,0M3,2M3,3MM0,2M0,1M0,0M0,3M1,1M1,0M1,2M1,3M2,1M2,0M2,2M2,3M3,1M3,0M3,2M3,3
  • 42.
    Simple Matrix Multiplication(CPU)void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏{ for (int i = 0; i < Width; ++i) {‏ for (int j = 0; j < Width; ++j) { float sum = 0; for (int k = 0; k < Width; ++k) { float a = M[i * width + k];float b = N[k * width + j];sum += a * b;}P[i * Width + j] = sum; } }}NkjWIDTHMPiWIDTHk42WIDTHWIDTH
  • 43.
    Simple Matrix Multiplication(GPU)void MatrixMulOnDevice(float* M, float* N, float* P, int Width)‏{intsize = Width * Width * sizeof(float); float* Md, Nd, Pd; … // 1. Allocate and Load M, N to device memory cudaMalloc(&Md, size);cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);cudaMalloc(&Nd, size);cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);// Allocate P on the devicecudaMalloc(&Pd, size);
  • 44.
    Simple Matrix Multiplication(GPU) // 2. Kernel invocation code – to be shown later … // 3. Read P from the devicecudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); // Free device matricescudaFree(Md); cudaFree(Nd); cudaFree(Pd);}
  • 45.
    Kernel Function// Matrixmultiplication kernel – per thread code__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏{ // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0;
  • 46.
    Kernel Function (contd.)for(int k = 0; k < Width; ++k)‏ {float Melement = Md[threadIdx.y*Width+k];float Nelement = Nd[k*Width+threadIdx.x];Pvalue+= Melement * Nelement; }Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;}NdkWIDTHtxMdPdtytyWIDTHtxk46WIDTHWIDTH
  • 47.
    Kernel Function (full)//Matrix multiplication kernel – per thread code__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏{ // Pvalue is used to store the element of the matrix// that is computed by the threadfloat Pvalue = 0; for (int k = 0; k < Width; ++k)‏ { float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x];Pvalue += Melement * Nelement; }Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;}
  • 48.
    Kernel Invocation (HostSide) // Setup the execution configurationdim3 dimGrid(1, 1);dim3 dimBlock(Width, Width);// Launch the device computation threads!MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);
  • 49.
    Only One ThreadBlock UsedNdGrid 1One Block of threads compute matrix PdEach thread computes one element of PdEach threadLoads a row of matrix MdLoads a column of matrix NdPerform one multiply and addition for each pair of Md and Nd elementsCompute to off-chip memory access ratio close to 1:1 (not very high)‏Size of matrix limited by the number of threads allowed in a thread blockBlock 1Thread(2, 2)‏48 WIDTHPdMd
  • 50.
    Handling Arbitrary SizedSquare MatricesHave each 2D thread block to compute a (TILE_WIDTH)2 sub-matrix (tile) of the result matrixEach has (TILE_WIDTH)2 threadsGenerate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocksNdWIDTHMdPdbyYou still need to put a loop around the kernel call for cases where WIDTH/TILE_WIDTH is greater than max grid size (64K)!TILE_WIDTHtyWIDTHbxtx50WIDTHWIDTH
  • 51.
    Small ExampleNd1,0Nd0,0Block(0,0)Block(1,0)Nd1,1Nd0,1P1,0P0,0P2,0P3,0Nd1,2Nd0,2TILE_WIDTH =2P0,1P1,1P3,1P2,1Nd0,3Nd1,3P0,2P2,2P3,2P1,2P0,3P2,3P3,3P1,3Pd1,0Md2,0Md1,0Md0,0Md3,0Pd0,0Pd2,0Pd3,0Md1,1Md0,1Md2,1Md3,1Pd0,1Pd1,1Pd3,1Pd2,1Block(1,1)Block(0,1)Pd0,2Pd2,2Pd3,2Pd1,2Pd0,3Pd2,3Pd3,3Pd1,3
  • 52.
    Cleanup TopicsMemory ManagementPinnedMemory (Zero-Transfer)Portable Pinned MemoryMulti-GPUWrappers (Python, Java, .NET)KernelsAtomicsThread Synchronization (staged reductions)NVCC
  • 53.

Editor's Notes

  • #10 Sparse linear algebra is interesting both because many science and engineering codes rely on it, and also because it was traditionally assumed to be something that GPUs would not be good at (because of irregular data access patterns). We have shown that in fact GPUs are extremely good at sparse matrix-vector multiply (SpMV), which is the basic building block of sparse linear algebra. The code and an accompanying white paper are available on the cuda forums and also posted on research.nvidia.com.This is compared to an extremely well-studied, well-optimized SpMV implementation from a widely respected paper in Supercomputing 2007. that paper only reported double-precision results for CPUs; our single precision results are even more impressive in comparison.
  • #11 Compared to highly optimizedfortran code from an oceanography researcher at UCLA
  • #16 Current implementation uses short-stack approach. Top elements of the stack are cached in registers.
  • #17 RTAPI enables implementation of manydifferent raytracing flavors.left-right, top-bottom: Procedural materials, Ambient occlusion, Whittedraytracer (thin shell glass and metalic spheres) Path tracer (Cornell box), Refactions, Cook-style distribution raytracingCould also do non-rendering stuff, e.g. GIS (line of sight say), physics (collision/proximity detection)