Gpu with cuda architecture

“GPU With CUDA Architecture”
Presented By-
Dhaval Kaneria (13014061010)
Guided By-
Mr. Rajesh k Navandar

Table Of Contents
• Introduction of GPU
• Performance Factors Of GPU
• GPU Pipeline
• Block Diagram Of Pipeline Process Flow
• Introduction Of CUDA
• Thread Batching
• Simple Processing Flow
• CUDA C/C++
• Applications
• The Future Scope Of CUDA Technology
• Conclusion
• References
2

Introduction of GPU
• A Graphics Processing Unit (GPU) is a microprocessor that has been designed
specifically for the processing of 3D graphics.
• The processor is built with integrated transform, lighting, triangle setup/clipping,
and rendering engines, capable of handling millions of math-intensive processes
per second.
• GPUs form the heart of modern graphics cards, relieving the CPU (central
processing units) of much of the graphics processing load. GPUs allow products
such as desktop PCs, portable computers, and game consoles to process real-time
3D graphics that only a few years ago were only available on high-end workstations.
• Used primarily for 3-D applications, a graphics processing unit is a single-chip
processor that creates lighting effects and transforms objects every time a 3D
scene is redrawn. These are mathematically-intensive tasks, which otherwise,
would put quite a strain on the CPU. Lifting this burden from the CPU frees up
cycles that can be used for other jobs.
3

Performance Factors Of GPU
• Fill Rate:
It is defined as the number of pixels or texels (textured pixels) rendered per second by the
GPU on to the memory . It shows the true power of the GPU. Modern GPUs have fill rates as
high as 3.2 billion pixels. The fill rate of a GPU can be increased by increasing the clock given
to it.
• Memory Bandwidth:
It is the data transfer speed between the graphics chip and its local frame buffer. More
bandwidth usually gives better performance with the image to be rendered is of high quality
and at very high resolution.
• Memory Management:
The performance of the GPU also depends on how efficiently the memory is managed,
because memory bandwidth may become the only bottle neck if not managed properly.
• Hidden Surface removal:
A term to describe the reducing of overdraws when rendering a scene by not rendering
surfaces that are not visible. This helps a lot in increasing performance of GPU.
4

GPU Pipeline
• The GPU receives geometry information from the CPU as an input and provides a
picture as an output
• The host interface is the communication bridge between the CPU and the GPU
• It receives commands from the CPU and also pulls geometry information from
system memory.
• It outputs a stream of vertices in object space with all their associated information
(normals, texture coordinates, per vertex color etc)
• The vertex processing stage receives vertices from the host interface in object
space and outputs them in screen space
• This may be a simple linear transformation, or a complex operation involving
morphing effects
host
interface
vertex
processing
triangle
setup
pixel
processing
memory
interface

Cont..
• A fragment is generated if and only if its center is inside the triangle
• Every fragment generated has its attributes computed to be the
perspective correct interpolation of the three vertices that make up the
triangle
• Each fragment provided by triangle setup is fed into fragment processing
as a set of attributes (position, normal, texcord etc), which are used to
compute the final color for this pixel Before the final write occurs, some
fragments are rejected by the zbuffer, stencil and alpha tests
6

Block Diagram Of Pipeline Process Flow
7

Cont..
• Allow shader to be applied to each vertex Transformation and other per
vertex ops
• Allow vertex shader to fetch texture data
• Cull/clip–per primitive operation and data preparation for rasterization
• Rasterization: primitive to pixel mapping
• Z culling: quick pixel elimination based on Depth
• Fragment : a candidate pixel Varying number of pixel pipelines
• SIMD processing hides texture fetch latency
8

Introduction Of CUDA
9
•CUDA aka Compute unified device architecture is parallel computing platform and
programing model which is implemented by graphics processing unit.

CUDA Programming Model:
A Highly Multithreaded Coprocessor
• The GPU is viewed as a compute device that:
 Is a coprocessor to the CPU or host
 Has its own DRAM (device memory)
 Runs many threads in parallel
• Data-parallel portions of an application are executed on the device as kernels which
run in parallel on many threads
• Differences between GPU and CPU threads
 GPU threads are extremely lightweight
 Very little creation overhead
 GPU needs 1000s of threads for full efficiency
 Multi-core CPU needs only a few

Thread Batching: Grids and Blocks
•A kernel is executed as a grid of thread
blocks
–All threads share data memory
space
•A thread block is a batch of threads that
can cooperate with each other by:
–Synchronizing their execution
•For hazard-free shared memory
accesses
–Efficiently sharing data through a
low latency shared memory
•Two threads from two different blocks
cannot cooperate
Host
Kernel
1
Kernel
2
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Courtesy: NDVIA

Block and Thread IDs
•Threads and blocks have IDs
–So each thread can decide what data to
work on
–Block ID: 1D or 2D
–Thread ID: 1D, 2D, or 3D
•Simplifies memory
•addressing when processing
•multidimensional data
–Image processing
–Solving PDEs on volumes
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Courtesy: NDVIA

CUDA Device Memory Space Overview
•Each thread can:
–R/W per-thread registers
–R/W per-thread local memory
–R/W per-block shared memory
–R/W per-grid global memory
–Read only per-grid constant memory
–Read only per-grid texture memory
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
The host can R/W global,
constant, and texture memories

Global, Constant, and Texture Memories
•Global memory
–Main means of communicating R/W
- Data between host and device
–Contents visible to all threads
•Texture and Constant Memories
–Constants initialized by host
–Contents visible to all threads
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
Courtesy: NDVIA

Simple Processing Flow
1. Copy input data from CPU memory to GPU memory
2. CPU instruct process to GPU
3. Load GPU program and execute, caching data on chip for performance
4. Copy results from GPU memory to CPU memory
15

CUDA C/C++
16
• CUDA Language:
C with Minimal Extensions
• Philosophy: provide minimal set of extensions necessary to expose power
• Declaration specifiers to indicate where things live
__global__ void KernelFunc(...); // kernel function, runs on device
__device__ int GlobalVar; // variable in device memory
__shared__ int SharedVar; // variable in per-block shared memory
• Extend function invocation syntax for parallel kernel launch
KernelFunc<<<500, 128>>>(...); // launch 500 blocks w/ 128 threads each
• Special variables for thread identification in kernels
dim3 threadIdx; dim3 blockIdx; dim3 blockDim; dim3 gridDim;
• Intrinsics that expose specific operations in kernel code
__syncthreads(); // barrier synchronization within kernel

Applications
17
•Military (lots)
•Mine planning
•Molecular dynamics
•MRI reconstruction
•Network processing
•Neural network
•Protein folding
•Quantum chemistry
•Ray tracing
•Radar
•Reservoir simulation
•Robotic vision/AI
•Robotic surgery
•Satellite data analysis
•Seismic imaging
•Surgery simulation
•3D image analysis
•Adaptive radiation therapy
•Astronomy
•Automobile vision
•Bio informatics
•Biological simulation
•Broadcast
•Computational Fluid Dynamics
•Computer Vision
•Cryptography
•CT reconstruction
•Data Mining
•Electromagnetic simulation
•Equity training
•Financial - lots of areas
•Mathematics research

Simulation Result
18
•If the CUDA software is installed and configured correctly, the output for deviceQuery should look similar

19
•Valid Results from bandwidth Test CUDA Sample

20
• Create an Array at the size of BLOCKS, allocate space for the array on the device, and
call,
generateArray<<<BLOCKS,1>>>( deviceArray );.
•This function will now run in BLOCKS parallel kernels, creating the entire array in one
call .

The Future Scope Of CUDA Technology
• Currently most of research is going on general purpose GPU. As GPU have a highly-
efficient and flexible parallel programmable features, a growing number of
researchers and business organizations started to use some of the non-graphical
rendering with GPU to implement the calculations, and create a new field of study:
GPGPU (General-Purpose computation on GPU) and its objective is to use GPU to
implement more extensive scientific computing. GPGPU has been successfully used in
algebra, fluid simulation, database applications, spectrum analysis, and other non-
graphical applications
• Region-based Software Virtual Memory (RSVM), a software virtual memory running
on both CPU and GPU in a distributed and cooperative way.
• Size reduction
• Cooling technique
21

Conclusion.
• CUDA is a powerful parallel programming model
Heterogeneous - mixed serial-parallel programming
Scalable - hierarchical thread execution model
Accessible - minimal but expressive changes to C
• CUDA on GPUs can achieve great results on data parallel computations with a
few simple performance optimization strategies:
• Structure your application and select execution configurations to maximize
exploitation of the GPU’s parallel capabilities.
• Minimize CPU ↔GPU data transfers.
• Coalesce global memory accesses.
• Take advantage of shared memory.
• Minimize divergent warps.
• Minimize use of low-throughput instructions.
22

References
1.Xiao Yang,Shamik K. Valia,Michael J. Schulte,Ruby B. Lee,” Exploration and Evaluation
of PLX Floating-point Instructions and Implementations for 3D Graphics ”,IEEE, Year -
2004
2.Lei Wang, Yong-zhong Huang,Xin Chen,Chun-yan Zhang,” Task Scheduling of Parallel
Processing in CPU-GPU Collaborative Environment ”,CSIT-2008
3.Feng Ji,Heshan Lin,Xiaosong Ma,’ RSVM: a Region-based Software Virtual Memory
for GPU’,IEEE-2013
4.“CUDA_Architecture_Overview” By Nathan Whitehead,Alex Fit-Florea,Nvidia
Corporation
5.“CUDA C/C++ Basics” By Cyril Zeller, NVIDIA Corporation
6.“Optimizing Parallel Reduction in CUDA” By Mark Harris ,NVIDIA Developer
Technology
23

Gpu with cuda architecture

More Related Content

What's hot

Viewers also liked

Similar to Gpu with cuda architecture

More from Dhaval Kaneria

Recently uploaded

Gpu with cuda architecture

Editor's Notes