Prepared by,
SAVITH.S
14CA65
MCA ,II SEM
NIT –K ,Surathkal
19/17/2015 Final Review Content 1
INTRODUCTION
 GPU- stands for Graphics processing unit (visual
processing unit (VPU)) ,simply our graphics card.
 Electronic circuit that is used to accelerate the creation
of images on frame buffer and enhances the output
quality.
 Generally interact with motherboard through
PCIexpress (PCIe) port or AGP ports.
 Very efficient tool for manipulating computer
graphics.
 Today, parallel GPUs have begun making
computational inroads against the CPU.
 It have wide range of options due to its parallel
behavior ,it leads to the development of GPGPU.
19/17/2015 Final Review Content 2
ABSTRACT
GPGPU stands for general purpose graphical processing
unit , that is designed to use the GPU for algorithms
that are traditionally run on the CPU. It makes the
algorithms very faster in execution and it saves the
processing time. So there is unlimited applications are
possible by the usage of this concept. Any GPU
providing a functionally complete set of operations
performed on arbitrary bits can compute any
computable value. Additionally, the use of multiple
graphics cards in one computer, or large numbers of
graphics chips, further parallelizes the already parallel
nature of graphics processing. So by customizing a
GPU we can implement a GPGPU which is hundred
times faster than a traditional cpu.
19/17/2015 Final Review Content 3
GPGPU
 Utilization of GPU for various computations which is
traditionaly handled by the CPU.
 Can perform any set of operations very accurately and
can compute any computable value.
 use of multiple graphics cards in one computer, or
large numbers of graphics chips, further parallelizes
the already parallel nature of graphics processing.
 This concept turns the massive computational power
of a modern graphics accelerator's shader pipeline into
general-purpose computing power .
19/17/2015 Final Review Content 4
STREAM PROCESSING
 STREAM is a set of records that need similar type of
computations .
 In traditional GPU we can read multiple independent
rows simultaneously perform operations and write
multiple outputs ,But never have a piece of m/y ie,
Both readable and writable.
 Utilization of massive computational power of a gpu
into the general purpose cpu related operations.
 Generalization of GPU.
19/17/2015 Final Review Content 5
CUDA
 CUDA stands for Compute Unified Device
Architecture devaloped by NVIDIA cooperation.
 It is used to develop software for graphics processors
and is used to develop a variety of general purpose
applications for GPUs that are highly parallel in nature
and run on hundreds of GPU’s processor cores.
 CUDA is supported only on NVIDIA’s GPUs based on
Tesla architecture. The graphics cards that support
CUDA are GeForce 8-series, Quadro, and Tesla.
 CUDA has some specific functions, called kernels. It is
executed N number of times in parallel on GPU by
using N number of threads.
19/17/2015 Final Review Content 6
EXISTING SYSTEM
 The existing system may be a traditional GPU or a
CPU.
 A CPU is used in normal case and which has less
performance power and throughput when compare to
the GPU also having less parallel nature.
 A GPU is the basic form of GPGPU but it is used only
for graphic accelerating options such as game consoles
high definition images , computer aided designs etc .
19/17/2015 Final Review Content 7
PROPOSED SYSTEM
 The GPGPU solves problems of a traditional CPU by
its highly parallel nature .
 In principle, any Boolean function can be built-up
from a functionally complete set of logic operators.
 GPGPU applications to have high arithmetic intensity
else the memory access latency will limit
computational speed up .
 Ideal GPGPU applications have large data sets, high
parallelism, and minimal dependency between data
elements.
19/17/2015 Final Review Content 8
EXPECTED FUNCTIONALITIES
 Having a high performance nature and about hundred
times faster than traditional CPU.
 Highly parallel behavior .
 Contains multiple cores , each has individual executing
nature.
 Implementing multiple GPU’s in a single system improves
its functionalities again and again .
 Can be customized to any type by using different platforms
like CUDA ,openCL etc.
 Major upcoming applications in high performance
computing areas.
19/17/2015 Final Review Content 9
WORKING
19/17/2015 Final Review Content 10
SYSTEM IMPLEMENTATION
 The implementation of system is completely carried
out by the hardware assembling of GPU in a
normal/special purpose computer system and software
installation .
 There is no separate software s needed for customizing
a GPU . It can be done by the CUDA C or CUDA C++
with NVIDIA’s compiler .
 The programming is in the usual way but we have to
implement some additional header files like #include
mpich.h (explained in the later section) in the
program .
19/17/2015 Final Review Content 11
HARDWARE REQUIREMENTS
 GPU shader cores, which run GPU kernels, are both
parallel and deeply multithreaded to provide
significant computational power, currently on the
order of a tera flopper GPU.
 Graphics memory, which is directly accessible by GPU
kernels, has a high clock rate and wide bus width to
provide substantial bandwidth, currently about a
hundred gigabytes per second.
 GPU interconnect, providing main board access to the
GPU. This is typically PCI Express, and so delivers afew
gigabytes per second of bandwidth.
19/17/2015 Final Review Content 12
 Main board RAM, which is directly accessible by CPU
programs and the network.
 CPU cores, which are deeply pipelined and
superscalarto provide good performance on sequential
programs.
 Network hardware, which moves bytes between
nodes. We use the trivial latency plus bandwidth
performance.
19/17/2015 Final Review Content 13
THE GRID AND BLOCK STRUCTURE
 The Grid consists of one-dimensional, two-
dimensional or three-dimensional thread blocks.
 Each thread block is further divided into one-
dimensional or two-dimensional threads.
 A thread block is a set of threads running on one
processor.
 All this threads creation, their execution, and
termination are automatic and handled by the GPU,
and is invisible to the programmer.
 The user only needs to specify the number of threads
in a thread block and the number of thread blocks in a
grid.
19/17/2015 Final Review Content 14
SINGLE PROGRAM MULTIPLE DATA
(SPMD) & MPICH
 GPU is suited for single program multiple data type
parallel calculations and work well with message
passing interface approach to programming .
 In SPMD concept , There is only a single program for
controlling various activities done in the GPU .
 There is no direct connection between our network
device and GPU memory.
 Thus to send GPU data across the network, we must
copy the send-side GPU data to CPU memory .
 So we are using a standard CPU interface such as MPI,
and finally copy the received data from CPU memory
into GPU memory.
19/17/2015 Final Review Content 15
 MPICH is a freely available, portable implementation
of MPI, a standard for message-passing for
distributed-memory applications used in parallel
computing’ .
 The CH part of the name was derived from
"Chameleon", which was a portable parallel
programming library developed by William Gropp,
one of the founders of MPICH.
 After installing MPICH they we have to create a user
using user add or the GUI. Also set password
MPD_SECRETWORD=password
 Here password refers to the password given for their
user id.
19/17/2015 Final Review Content 16
 Next change the read/write/execute profile of
.mpd.conf using chmod 600.mpd.conf . Next create a
file named mpd.hosts containing the following :
Master
Node1
Node2
..
Node m-1
Where m in node m-1 refers to the total number of
nodes .
 Next to boot MPICH type :
Mpdboot –n m –r ssh –f mpd.hosts
19/17/2015 Final Review Content 17
Given below is a sample program using mpi in cuda .
#include<stdio.h>
#include<mpi.h>
Main(int argc,char **,argv){
Int I,root=0,rank,size,tag=32,mysum,total;
Mpi_init(&argc,&argv);
/*gets rank or identity of each processor*/
Mpi_Comm_rank(MPI_COMM_WORLD,&rank);
/*gets the total number of available processors*/
Mpi_Comm_size(MPI_COMM_WORLD,&size);
Mysum=0;
Total=0;
For(i=rank+1;i<100;i=i+size)
mysum=mysum+1;
/*Adds all the partial sums called mysum and stores into total at root using the MPI_SUM
call */
MPI_Reduce(&mysum,&total,tag,MPI_INT,MPI_SUM,root,
MPI_INT,MPI_SUM,root,MPI_COMM_WORLD)
If(rank==0)
Printf(“The total is %dn”,total);
}
In this program mp processors add upto 100. The first processor adds 1,mp+1,2*mp+1 ; the
second processor adds 2,mp+2,2*mp+2 ; etc…..
19/17/2015 Final Review Content 18
APPLICATIONS
Research: Higher Education and Supercomputing.
Computational Chemistry and Biology.
Bioinformatics
Molecular Dynamics
High Performance Computing (HPC) clusters.
Grid computing.
Auto signal processing.
Scientific computing.
19/17/2015 Final Review Content 19
CONCLUSION AND FUTURE WORK
 It is clear that by using GPGPU we can execute many
rows of data in parallel thus it provides a high
performance .
 The NVIDIA’s cuda is well suited for building the
GPGPU platform.
 We have presented and benchmarked cudaMPI and
glMPI, message passing libraries for distributed-
memory GPU clusters.
 Many functions and variables of MPI is under
development .
 A common platform for various GPU’s is needed and is
under development .
19/17/2015 Final Review Content 20
19/17/2015 Final Review Content 21
Interactive Section
19/17/2015 Final Review Content 22

GPGPU programming with CUDA

  • 1.
    Prepared by, SAVITH.S 14CA65 MCA ,IISEM NIT –K ,Surathkal 19/17/2015 Final Review Content 1
  • 2.
    INTRODUCTION  GPU- standsfor Graphics processing unit (visual processing unit (VPU)) ,simply our graphics card.  Electronic circuit that is used to accelerate the creation of images on frame buffer and enhances the output quality.  Generally interact with motherboard through PCIexpress (PCIe) port or AGP ports.  Very efficient tool for manipulating computer graphics.  Today, parallel GPUs have begun making computational inroads against the CPU.  It have wide range of options due to its parallel behavior ,it leads to the development of GPGPU. 19/17/2015 Final Review Content 2
  • 3.
    ABSTRACT GPGPU stands forgeneral purpose graphical processing unit , that is designed to use the GPU for algorithms that are traditionally run on the CPU. It makes the algorithms very faster in execution and it saves the processing time. So there is unlimited applications are possible by the usage of this concept. Any GPU providing a functionally complete set of operations performed on arbitrary bits can compute any computable value. Additionally, the use of multiple graphics cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing. So by customizing a GPU we can implement a GPGPU which is hundred times faster than a traditional cpu. 19/17/2015 Final Review Content 3
  • 4.
    GPGPU  Utilization ofGPU for various computations which is traditionaly handled by the CPU.  Can perform any set of operations very accurately and can compute any computable value.  use of multiple graphics cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.  This concept turns the massive computational power of a modern graphics accelerator's shader pipeline into general-purpose computing power . 19/17/2015 Final Review Content 4
  • 5.
    STREAM PROCESSING  STREAMis a set of records that need similar type of computations .  In traditional GPU we can read multiple independent rows simultaneously perform operations and write multiple outputs ,But never have a piece of m/y ie, Both readable and writable.  Utilization of massive computational power of a gpu into the general purpose cpu related operations.  Generalization of GPU. 19/17/2015 Final Review Content 5
  • 6.
    CUDA  CUDA standsfor Compute Unified Device Architecture devaloped by NVIDIA cooperation.  It is used to develop software for graphics processors and is used to develop a variety of general purpose applications for GPUs that are highly parallel in nature and run on hundreds of GPU’s processor cores.  CUDA is supported only on NVIDIA’s GPUs based on Tesla architecture. The graphics cards that support CUDA are GeForce 8-series, Quadro, and Tesla.  CUDA has some specific functions, called kernels. It is executed N number of times in parallel on GPU by using N number of threads. 19/17/2015 Final Review Content 6
  • 7.
    EXISTING SYSTEM  Theexisting system may be a traditional GPU or a CPU.  A CPU is used in normal case and which has less performance power and throughput when compare to the GPU also having less parallel nature.  A GPU is the basic form of GPGPU but it is used only for graphic accelerating options such as game consoles high definition images , computer aided designs etc . 19/17/2015 Final Review Content 7
  • 8.
    PROPOSED SYSTEM  TheGPGPU solves problems of a traditional CPU by its highly parallel nature .  In principle, any Boolean function can be built-up from a functionally complete set of logic operators.  GPGPU applications to have high arithmetic intensity else the memory access latency will limit computational speed up .  Ideal GPGPU applications have large data sets, high parallelism, and minimal dependency between data elements. 19/17/2015 Final Review Content 8
  • 9.
    EXPECTED FUNCTIONALITIES  Havinga high performance nature and about hundred times faster than traditional CPU.  Highly parallel behavior .  Contains multiple cores , each has individual executing nature.  Implementing multiple GPU’s in a single system improves its functionalities again and again .  Can be customized to any type by using different platforms like CUDA ,openCL etc.  Major upcoming applications in high performance computing areas. 19/17/2015 Final Review Content 9
  • 10.
  • 11.
    SYSTEM IMPLEMENTATION  Theimplementation of system is completely carried out by the hardware assembling of GPU in a normal/special purpose computer system and software installation .  There is no separate software s needed for customizing a GPU . It can be done by the CUDA C or CUDA C++ with NVIDIA’s compiler .  The programming is in the usual way but we have to implement some additional header files like #include mpich.h (explained in the later section) in the program . 19/17/2015 Final Review Content 11
  • 12.
    HARDWARE REQUIREMENTS  GPUshader cores, which run GPU kernels, are both parallel and deeply multithreaded to provide significant computational power, currently on the order of a tera flopper GPU.  Graphics memory, which is directly accessible by GPU kernels, has a high clock rate and wide bus width to provide substantial bandwidth, currently about a hundred gigabytes per second.  GPU interconnect, providing main board access to the GPU. This is typically PCI Express, and so delivers afew gigabytes per second of bandwidth. 19/17/2015 Final Review Content 12
  • 13.
     Main boardRAM, which is directly accessible by CPU programs and the network.  CPU cores, which are deeply pipelined and superscalarto provide good performance on sequential programs.  Network hardware, which moves bytes between nodes. We use the trivial latency plus bandwidth performance. 19/17/2015 Final Review Content 13
  • 14.
    THE GRID ANDBLOCK STRUCTURE  The Grid consists of one-dimensional, two- dimensional or three-dimensional thread blocks.  Each thread block is further divided into one- dimensional or two-dimensional threads.  A thread block is a set of threads running on one processor.  All this threads creation, their execution, and termination are automatic and handled by the GPU, and is invisible to the programmer.  The user only needs to specify the number of threads in a thread block and the number of thread blocks in a grid. 19/17/2015 Final Review Content 14
  • 15.
    SINGLE PROGRAM MULTIPLEDATA (SPMD) & MPICH  GPU is suited for single program multiple data type parallel calculations and work well with message passing interface approach to programming .  In SPMD concept , There is only a single program for controlling various activities done in the GPU .  There is no direct connection between our network device and GPU memory.  Thus to send GPU data across the network, we must copy the send-side GPU data to CPU memory .  So we are using a standard CPU interface such as MPI, and finally copy the received data from CPU memory into GPU memory. 19/17/2015 Final Review Content 15
  • 16.
     MPICH isa freely available, portable implementation of MPI, a standard for message-passing for distributed-memory applications used in parallel computing’ .  The CH part of the name was derived from "Chameleon", which was a portable parallel programming library developed by William Gropp, one of the founders of MPICH.  After installing MPICH they we have to create a user using user add or the GUI. Also set password MPD_SECRETWORD=password  Here password refers to the password given for their user id. 19/17/2015 Final Review Content 16
  • 17.
     Next changethe read/write/execute profile of .mpd.conf using chmod 600.mpd.conf . Next create a file named mpd.hosts containing the following : Master Node1 Node2 .. Node m-1 Where m in node m-1 refers to the total number of nodes .  Next to boot MPICH type : Mpdboot –n m –r ssh –f mpd.hosts 19/17/2015 Final Review Content 17
  • 18.
    Given below isa sample program using mpi in cuda . #include<stdio.h> #include<mpi.h> Main(int argc,char **,argv){ Int I,root=0,rank,size,tag=32,mysum,total; Mpi_init(&argc,&argv); /*gets rank or identity of each processor*/ Mpi_Comm_rank(MPI_COMM_WORLD,&rank); /*gets the total number of available processors*/ Mpi_Comm_size(MPI_COMM_WORLD,&size); Mysum=0; Total=0; For(i=rank+1;i<100;i=i+size) mysum=mysum+1; /*Adds all the partial sums called mysum and stores into total at root using the MPI_SUM call */ MPI_Reduce(&mysum,&total,tag,MPI_INT,MPI_SUM,root, MPI_INT,MPI_SUM,root,MPI_COMM_WORLD) If(rank==0) Printf(“The total is %dn”,total); } In this program mp processors add upto 100. The first processor adds 1,mp+1,2*mp+1 ; the second processor adds 2,mp+2,2*mp+2 ; etc….. 19/17/2015 Final Review Content 18
  • 19.
    APPLICATIONS Research: Higher Educationand Supercomputing. Computational Chemistry and Biology. Bioinformatics Molecular Dynamics High Performance Computing (HPC) clusters. Grid computing. Auto signal processing. Scientific computing. 19/17/2015 Final Review Content 19
  • 20.
    CONCLUSION AND FUTUREWORK  It is clear that by using GPGPU we can execute many rows of data in parallel thus it provides a high performance .  The NVIDIA’s cuda is well suited for building the GPGPU platform.  We have presented and benchmarked cudaMPI and glMPI, message passing libraries for distributed- memory GPU clusters.  Many functions and variables of MPI is under development .  A common platform for various GPU’s is needed and is under development . 19/17/2015 Final Review Content 20
  • 21.
  • 22.