CODE GPU WITH CUDA
CUDA
INTRODUCTION
CreatedbyMarinaKolpakova( )forcuda.geek Itseez
PREVIOUS
OUTLINE
Terminology
Definition
Programming model
Execution model
Memory models
CUDA kernel
OUT OF SCOPE
CUDA API overview
TERMINOLOGY
Device
CUDA-capable NVIDIA GPU
Device code
code executed on the device
Host
x86/x64/arm CPU
Host code
code executed on the host
Kernel
concrete device function
CUDA
CUDA is a Compute Unified Device Arhitecture.
CUDA includes:
1. Capable GPU hardware and driver
2. Device ISA, GPU assembler, Compiler
3. C++ based HL language, CUDA Runtime
CUDA defines:
programming model
execution model
memory model
PROGRAMMING MODEL
Kernel is executed by many threads
PROGRAMMING MODEL
Threads are grouped into blocks
Each thread has a thread ID
PROGRAMMING MODEL
Thread blocks form an execution grid
Each block has a block ID
EXECUTION (HW MAPPING) MODEL
Single thread is executed on core
EXECUTION (HW MAPPING) MODEL
Each block is executed by one SM and does not migrate
Number of concurrent blocks that can reside on SM depends on available resources
EXECUTION (HW MAPPING) MODEL
Threads in a block can cooperate via shared memory and synchronization
There is no hardware support for cooperation between threads from different blocks
EXECUTION (HW MAPPING) MODEL
One or multiple (sm_20+) kernels are executed on the device
MEMORY MODEL
Thread has its own registers
MEMORY MODEL
Thread has its own local memory
MEMORY MODEL
Block has shared memory
Pointer to shared memory is valid while block is resident
_ _ s h a r e d _ _ f l o a t b u f f e r [ C T A _ S I Z E ] ;
MEMORY MODEL
Grid is able to access global and constant memory
BASIC CUDA KERNEL
Work for GPU threads represented as kernel
kernel represents a task for single thread (scalar notation)
Every thread in a particular grid executes the same kernel
Threads use their threadIdx and blockIdx to dispatch work
Kernel function is marked with __global__ keyword
Common kernel structure:
1. Retrieving position in grid (widely named tid)
2. Loading data form GPU’s memory
3. Performing compute work
4. Writing back the result into GPU’s memory
_ _ g l o b a l _ _ v o i d k e r n e l ( f l o a t * i n , f l o a t * o u t )
{
i n t t i d = b l o c k I d x . x * b l o c k D i m . x + t h r e a d I d x . x ;
o u t [ t i d ] = i n [ t i d ] ;
}
KERNEL EXECUTION
v o i d e x e c u t e _ k e r n e l ( c o n s t * f l o a t h o s t _ i n , f l o a t * h o s t _ o u t , i n t s i z e )
{
f l o a t * d e v i c e _ i n , * d e v i c e _ o u t ;
c u d a M a l l o c ( ( v o i d * * ) & d e v i c e _ i n , s i z e * s i z e o f ( f l o a t ) ) ;
c u d a M a l l o c ( ( v o i d * * ) & d e v i c e _ o u t , s i z e * s i z e o f ( f l o a t ) ) ;
/ / 1 . U p l o a d d a t a i n t o d e v i c e m e m o r y
c u d a M e m c p y ( d e v i c e _ i n , h o s t _ i n , c u d a M e m c p y H o s t T o D e v i c e ) ;
/ / 2 . C o n f i g u r e k e r n e l l a u n c h
d i m 3 b l o c k ( 2 5 6 ) ;
d i m 3 g r i d ( s i z e / 2 5 6 ) ;
/ / 3 . E x e c u t e k e r n e l
k e r n e l < < < g r i d , b l o c k > > > ( d e v i c e _ i n , d e v i c e _ o u t ) ;
/ / 4 . W a i t t i l l c o m p l e t i o n
c u d a T h r e a d S y n c h r o n i z e ( ) ;
/ / 5 . D o w n l o a d r e s u l t s i n t o h o s t m e m o r y
c u d a M e m c p y ( h o s t _ o u t , d e v i c e _ o u t , c u d a M e m c p y D e v i c e T o H o s t ) ;
}
FINAL WORDS
CUDA is a set of capable GPU hardware, driver, GPU ISA, GPU assembler, compiler, C++
based HL language and runtime which enables programming of NVIDIA GPU
CUDA function (kernel) is called on a grid of blocks
Kernel runs on unified programmable cores
Kernel is able to access registers and local memory, share memory inside a block of
threads and access RAM through global, texture and constant memories
THE END
NEXT
BY / 2013–2015CUDA.GEEK

Code gpu with cuda - CUDA introduction

  • 1.
    CODE GPU WITHCUDA CUDA INTRODUCTION CreatedbyMarinaKolpakova( )forcuda.geek Itseez PREVIOUS
  • 2.
  • 3.
    OUT OF SCOPE CUDAAPI overview
  • 4.
    TERMINOLOGY Device CUDA-capable NVIDIA GPU Devicecode code executed on the device Host x86/x64/arm CPU Host code code executed on the host Kernel concrete device function
  • 5.
    CUDA CUDA is aCompute Unified Device Arhitecture. CUDA includes: 1. Capable GPU hardware and driver 2. Device ISA, GPU assembler, Compiler 3. C++ based HL language, CUDA Runtime CUDA defines: programming model execution model memory model
  • 6.
    PROGRAMMING MODEL Kernel isexecuted by many threads
  • 7.
    PROGRAMMING MODEL Threads aregrouped into blocks Each thread has a thread ID
  • 8.
    PROGRAMMING MODEL Thread blocksform an execution grid Each block has a block ID
  • 9.
    EXECUTION (HW MAPPING)MODEL Single thread is executed on core
  • 10.
    EXECUTION (HW MAPPING)MODEL Each block is executed by one SM and does not migrate Number of concurrent blocks that can reside on SM depends on available resources
  • 11.
    EXECUTION (HW MAPPING)MODEL Threads in a block can cooperate via shared memory and synchronization There is no hardware support for cooperation between threads from different blocks
  • 12.
    EXECUTION (HW MAPPING)MODEL One or multiple (sm_20+) kernels are executed on the device
  • 13.
    MEMORY MODEL Thread hasits own registers
  • 14.
    MEMORY MODEL Thread hasits own local memory
  • 15.
    MEMORY MODEL Block hasshared memory Pointer to shared memory is valid while block is resident _ _ s h a r e d _ _ f l o a t b u f f e r [ C T A _ S I Z E ] ;
  • 16.
    MEMORY MODEL Grid isable to access global and constant memory
  • 17.
    BASIC CUDA KERNEL Workfor GPU threads represented as kernel kernel represents a task for single thread (scalar notation) Every thread in a particular grid executes the same kernel Threads use their threadIdx and blockIdx to dispatch work Kernel function is marked with __global__ keyword Common kernel structure: 1. Retrieving position in grid (widely named tid) 2. Loading data form GPU’s memory 3. Performing compute work 4. Writing back the result into GPU’s memory _ _ g l o b a l _ _ v o i d k e r n e l ( f l o a t * i n , f l o a t * o u t ) { i n t t i d = b l o c k I d x . x * b l o c k D i m . x + t h r e a d I d x . x ; o u t [ t i d ] = i n [ t i d ] ; }
  • 18.
    KERNEL EXECUTION v oi d e x e c u t e _ k e r n e l ( c o n s t * f l o a t h o s t _ i n , f l o a t * h o s t _ o u t , i n t s i z e ) { f l o a t * d e v i c e _ i n , * d e v i c e _ o u t ; c u d a M a l l o c ( ( v o i d * * ) & d e v i c e _ i n , s i z e * s i z e o f ( f l o a t ) ) ; c u d a M a l l o c ( ( v o i d * * ) & d e v i c e _ o u t , s i z e * s i z e o f ( f l o a t ) ) ; / / 1 . U p l o a d d a t a i n t o d e v i c e m e m o r y c u d a M e m c p y ( d e v i c e _ i n , h o s t _ i n , c u d a M e m c p y H o s t T o D e v i c e ) ; / / 2 . C o n f i g u r e k e r n e l l a u n c h d i m 3 b l o c k ( 2 5 6 ) ; d i m 3 g r i d ( s i z e / 2 5 6 ) ; / / 3 . E x e c u t e k e r n e l k e r n e l < < < g r i d , b l o c k > > > ( d e v i c e _ i n , d e v i c e _ o u t ) ; / / 4 . W a i t t i l l c o m p l e t i o n c u d a T h r e a d S y n c h r o n i z e ( ) ; / / 5 . D o w n l o a d r e s u l t s i n t o h o s t m e m o r y c u d a M e m c p y ( h o s t _ o u t , d e v i c e _ o u t , c u d a M e m c p y D e v i c e T o H o s t ) ; }
  • 19.
    FINAL WORDS CUDA isa set of capable GPU hardware, driver, GPU ISA, GPU assembler, compiler, C++ based HL language and runtime which enables programming of NVIDIA GPU CUDA function (kernel) is called on a grid of blocks Kernel runs on unified programmable cores Kernel is able to access registers and local memory, share memory inside a block of threads and access RAM through global, texture and constant memories
  • 20.
    THE END NEXT BY /2013–2015CUDA.GEEK