IJSRD - International Journal for Scientific Research & Development| Vol. 1, Issue 4, 2013 | ISSN (online): 2321-0613
All rights reserved by www.ijsrd.com 862
Abstract--In today’s digital world, Data sets are increasing
exponentially. Statistical analysis using clustering in
various scientific and engineering applications become
very challenging issue for such large data set. Clustering
on huge data set and its performance are two major factors
demand for optimization. Parallelization is well-known
approach to optimize performance. It has been observed
from recent research work that GPU based parallelization
help to achieve high degree of performance. Hence, this
thesis focuses on optimizing hierarchical clustering
algorithms using parallelization. It covers implementation
of optimized algorithm on various parallel environments
using Open MP on multi-core architecture and using
CUDA on may-core architecture.
I. INTRODUCTION
Clustering is the task of grouping a set of objects in such a
way that objects in the same group (called cluster) are
more similar (in some sense or another) to each other than
to those in other groups (clusters). It is a main task of
explorative data mining, and a common technique for
statistical data analysis used in many fields, including
machine learning, pattern recognition, image analysis,
information retrieval, and bioinformatics.
A "clustering" is essentially a set of such clusters, usually
containing all objects in the data set. Additionally, it may
specify the relationship of the clusters to each other, for
example a hierarchy of clusters embedded in each other.
Clustering can be roughly distinguished in:
1) Hard clustering: each object belongs to a cluster or
not
2) Soft clustering (fuzzy clustering): each object belongs
to each cluster to a certain degree (e.g. a likelihood of
belonging to the cluster)
A. Classification of Clustering
Traditionally clustering techniques are
broadly divided in [1]:
1) Hierarchical methods
2) Partitioning Methods
3) Density Based Algorithms
4) Grid Based Clustering
B. Hierarchical Methods
Hierarchical clustering is a method of cluster
analysis which seeks to build a hierarchy of
clusters. Hierarchical clustering generally falls into
two types:
1) Agglomerative Algorithm
2) Division Algorithm
In hierarchical clustering the data are not partitioned into
a particular cluster in a single step. Instead, a series of
partitions takes place, which may run from a single
cluster containing all objects to n clusters each containing
a single object. Hierarchical Clustering is subdivided into
agglomerative methods, which proceed by series of
fusions of the n objects into groups, and divisive
methods, which separate n objects successively into finer
groupings. Hierarchical clustering may be represented by
a two dimensional diagram known as dendrogram which
illustrates the fusions or divisions made at each
successive stage of analysis.
C. Partitioning Methods
The partitioning methods generally result in a set of M
clusters, each object belonging to one cluster. Each cluster
may be represented by a centroid or a cluster
representative; this is some sort of summary description of
all the objects contained in a cluster. The precise form of
this description will depend on the type of the object
which is being clustered. In case where real-valued data is
available, the arithmetic mean of the attribute vectors for
all objects within a cluster provides an appropriate
representative; alternative types of centroid may be
required in other cases, e.g., a cluster of documents can be
represented by a list of those keywords that occur in some
minimum number of documents within a cluster. If the
number of the clusters is large, the centroids can be
further clustered to produces hierarchy within a dataset.
D. Density-Based Algorithms
Density-based algorithms are capable of discovering
clusters of arbitrary shapes. Also this provides a natural
protection against outliers. These algorithms group
objects according to specific density objective functions.
Density is usually defined as the number of objects in a
particular neighborhood of a data objects. In these
approaches a given cluster continues growing as long as
the number of objects in the neighborhood exceeds some
parameter.
E. Grid Based Clustering
These focus on spatial data i.e. the data that model the
geometric structure of objects in the space, their
relationships, properties and operations. This technique
quantizes the data set into a no. of cells and then work with
objects belonging to these cells. They do not relocate
points but ratter builds several hierarchical levels of groups
of objects. The merging of grids and consequently clusters,
does not depend on a distance measure .It is determined by
a predefined parameter.
1) Mode of Performance Optimization
Performance optimization is foremost objective when
we are performing many computations on processor.
Performance Optimization of Clustering On GPU
Vijal D. Patel1
Prof. Sumitra Menaria2
1, 2
C.S.E. Dept., Parul Institute of Engineering & Technology
Performance Optimization of Clustering on GPU
(IJSRD/Vol. 1/Issue 4/2013/0014)
All rights reserved by www.ijsrd.com 863
So following techniques are used for performance
optimization:
1) Efficient Resources (Devices )
2) Optimized computing Algorithm
3) Best utilization of resources with less cost
(Parallel processing)
This paper proposes a mechanism for High Performance
Computing.
II. GPU ARCHITECTURE
GPU (Graphics Processing Unit) have been of large
interest in last decade to achieve high performance
computing through massive data parallelism model. GPU
has hundreds of core processors. GPUs are designed for
graphics. GPU is used to process similar operation on
independent vertices & pixels in parallel. GPU is more
suitable for stream computations like SIMD (single
Instruction multiple data) programming model. For higher
efficiency, GPU processes many data elements in parallel
with same program module.
Last decade, to satisfy demand of high performance
computing a tremendous rise in computing power has been
observed. Although CPU has been constantly increasing
their performance, the growth has been over shadowed by
the improvement in performance of GPU. This trend has
led to draw the interest of researchers with high
computational requirements to look for GPUs as a possible
solution. This has led to the emergence of a new term
GPGPU (General Purpose computing using Graphical
Processing Units).
GPGPU is defined as General Purpose computation using
GPU and graphics API in applications other than 3D
graphics. In order to achieve higher efficiency, GPGPU is
designed as a combination between hardware components
and software that allows the use of a traditional GPU
(Graphics processor) to perform non-graphical computing
tasks with higher processing power. [2]
Fig. 1 describes GPU architecture. Each GPU has number
of Streaming Multiprocessors(SM) i.e.30 SMs and each
streaming multiprocessor has eight streaming core
processor. This architecture has characteristics to perform
same operation on collection of records (Data parallelism).
Hence, GPU uses streaming processor as core processor to
achieve massive data parallelism. Each NVIDIA GPU has
a hundreds of parallel cores and within each core has
floating point unit, logic unit (add, sub, mul), move and
compare unit, and a branch unit.
Figure (1): GPU Architecture [2]
Hence, GPU uses streaming processor as core processor to
achieve massive data parallelism. Each NVIDIA GPU has
a hundreds of parallel cores and within each core has
floating point unit, logic unit (add, sub, mul), move and
compare unit, and a branch unit.
Each core is managed by a thread manager. A thread
manager can spawn and manage hundreds of threads per
core. A very important feature is that it uses lightweight
threads that have little creation overhead and zero-
overhead scheduling. The GPU is highly parallel compute
coprocessor that has its own device memory with high
bandwidth.
III. CUDA
Compute Unified Device Architecture (CUDA) is a
Programming model and the APIs required for performing
computation tasks on NVIDIA GPUs. CUDA is a scalable
highly parallel and multithreaded API to execute a
program on any number of processor without recompiling.
CUDA allow a program written for a platform to be
executed in other platform without recompilation. CUDA
provides efficient programming environment for regular
applications. CUDA is designed to support many
programming languages as shown in Fig.2. It provides
extension to programming languages for performing
computation on NVIDIA GPUs. CUDA C is extension to
C language with CUDA API to operate on NVIDIA GPU.
It allows a programmer to write a program for one thread,
and it instantiate for multiple parallel threads.
Figure (2): CUDA Interface [2]
A. CUDA Programming Execution model
Figure (3): CUDA execution model [2]
Programming model provided for designing and executing
general purpose computation kernel on NVIDIA GPU is
named as Compute Unified Device Architecture
(CUDA).CUDA programming modal is available for most
known operating systems Windows, and Linux. CUDA
programming modal allows programmer to use GPU for
parallel programming without knowledge of graphics.
Fig.3 describes CUDA computational model to be
Performance Optimization of Clustering on GPU
(IJSRD/Vol. 1/Issue 4/2013/0014)
All rights reserved by www.ijsrd.com 864
executed on CPU and GPU both. A CUDA program is
sequential host program and one or many kernel code as
shown in Fig.3.Serial code will be executed by a CPU
thread, and Kernel code will be executed by GPU threads
grouped in a block.
Kernel here specifies portion of computational code to be
executed on NVIDIA GPU device. CUDA kernel executes
on a grid of thread blocks, where a thread block is a group
of threads that work in SIMT(Single Instruction Multiple
Thread) model. Threads of each block execute cooperates
each other using shared memory and synchronizing its
execution. Threads from different blocks operate
independently. A grid can be of 1 or 2 dimensions, and a
thread block can be of 1, 2, or 3 dimension as shown in
Fig.3. A grid can have up to 65535 blocks of threads in
each dimension and a thread block can support 768 or
1024 threads depending on GPU architecture. Each thread
on a block and a block on grid has a unique identification.
IV. MCL ALGORITHM
MCL uses two simple algebraic operations, expansion and
inflation, on the stochastic (Markov) matrix associated
with a graph. The Markov matrix M associated with a
graph G is defined by normalizing all columns of the
adjacency matrix of G. The clustering process simulates
random walks (or flow) within the graph using expansion
operations, and then strengthens the flow where it is
already strong and weakens it where it is weak using
inflation operations. By continuously alternating these two
processes, the underlying structure of the graph gradually
becomes apparent, and there is
Converges to a result with regions with strong internal
flow (clusters) separated by boundaries within which flow
is absent.
1. // G is a graph, matrix M and a real
number r > 1
2. // columns of M with power coefficient r is
written Γr(M), and Γr is called the inflation
operator with power coefficient r
3. // Γr(Mij) = Mijr / Σr,j(M)
4. add loops to G
5. set Γ to some value
6. set M_1 to be the matrix of random walks on
G
7. while (change)
8. {
9. M_2 = M_1 * M_1 # expansion
10. M_1 = Γ(M_2) # inflation
11. change = difference(M_1, M_2)
12. }
13. set CLUSTERING as the components of
M_1
CUDA MCL Algorithm
Listing (1): MCL Algorithm [12]
V. CLUSTERING RESULTS AND ANALYSIS
In Fig 4 some of the results of our clusters analysis are
shown, comparing number of clusters against cluster sizes,
for the clusters output from data sets PPI1 (Bio GRID)
Figure (4): Comparison of MCL Execution time for SpMv
using ELLAPCK-R format
Figure (5): Speedup of MCL Execution time for Sparse-
matrix using ELLAPCK-R format
Performance Optimization of Clustering on GPU
(IJSRD/Vol. 1/Issue 4/2013/0014)
All rights reserved by www.ijsrd.com 865
Figure (6): Comparison of MCL Execution time for SpMv
matrix using ELLPACK-R format on GPU without
transfer of data
Figure (7): Comparison of MCL Execution time on CPU
and GPU using ELLPACK-R Format
VI. CONCLUSION AND FUTURE WORK
1) As it has been Clustering Algorithm giving best
performance on small data set so we will implement it
on large data set and different GPU platform and
optimize the Performance.
2) After Observation Parallel Computing is most
important mode of Performance Optimization.
3) For high Performance we will implement Many
Clustering Algorithm on Different
4) Platform of GPU and find out which platform is best
according to Application.
REFERENCES
[1] Pradeep Rai, Shubha Singh, “A Survey of Clustering
Techniques” in proceedings of the International
journal of Computer Applications (0975-8887)
Volume 7-No.12, October 2010.
[2] NVIDIA Corporation, NVIDIA Programming Guide,
Version 3.1.1,July 2010.
[3] Osama Abu Abbas, “Comparison Between Data
Clustering Algorithm” in proceedings of the
International Arab Journal of Information Technology,
Vol. 5, No. 3, July 2008.
[4] A.K. Jain, M.N. Murty, P.J. Flynn, “Data Clustering:
A Review” ACM computing Surveys, Vol. 31, No. 3,
September 1999.
[5] Wenbin Fang, Milan Lu, Xiangye Xiao, Bingsheng He,
Qiong Luo, “Frequent Itemset Mining on Graphics
Processors” in proceeding of the Fifth International
Workshop on Data management on New Hardware,
June 28, 2009.
[6] Balaji Dhanasekaran, Norman Rubin, “A new method
for GPU based Irregular Reductions and Its
Application to K-Means Clustering” in proceedings of
the International Conference on GPGPU-4, March
2011.
[7] Feng Cao, A K.H. Tung, A Zhou, “Scalable Clustering
Using Graphics Processors” School of Computing,
National University of Singapore, 2004.
[8] Ren Wu, Bin Zhang, Meichun Hsu, “Clustering
Billions of Data Points Using GPUs” UCHPC-
MAW’09, May 2009, Ischia, Italy.
[9] Wenjing Ma, Gagan Agrawal, “A Translation System
for Enabling Data Mining Applications On GPUs”,
ICS’09, June 2009, New York, USA, pp. 400-409.
[10] Hanaa M. Hussain, K Benkrid, A Erdogan, H Seker,
“Highly Parameterized K-means Clustering on FPGAs:
Comparative Results with GPPs and GPUs”, in
proceeding of the International Conference on
Reconfigurable Computing and FPGAs, IEEE 2011.
[11] Dar-Jen Chang, M Kantardzic, M Ouyang, “
Hierarchical clustering with CUDA/GPU”, UCHPC-
MAW, 2010.
[12] Alhadi Bustamam, Kevin Burrage, Nicholas A.
Hamilton, “Fast Parallel Markov Clustering in
Bioinformatics Using Massively Parallel Computing
on GPU with CUDA and ELLPACK-R Sparse
Format”, in proceeding of the IEEE/ACM
TRANSACTIONS ON COMPUTATIONAL
BIOLOGY AND BIOINFORMATICS, VOL. 9, NO.
3, MAY/JUNE 2012.

Performance Optimization of Clustering On GPU

  • 1.
    IJSRD - InternationalJournal for Scientific Research & Development| Vol. 1, Issue 4, 2013 | ISSN (online): 2321-0613 All rights reserved by www.ijsrd.com 862 Abstract--In today’s digital world, Data sets are increasing exponentially. Statistical analysis using clustering in various scientific and engineering applications become very challenging issue for such large data set. Clustering on huge data set and its performance are two major factors demand for optimization. Parallelization is well-known approach to optimize performance. It has been observed from recent research work that GPU based parallelization help to achieve high degree of performance. Hence, this thesis focuses on optimizing hierarchical clustering algorithms using parallelization. It covers implementation of optimized algorithm on various parallel environments using Open MP on multi-core architecture and using CUDA on may-core architecture. I. INTRODUCTION Clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of explorative data mining, and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. A "clustering" is essentially a set of such clusters, usually containing all objects in the data set. Additionally, it may specify the relationship of the clusters to each other, for example a hierarchy of clusters embedded in each other. Clustering can be roughly distinguished in: 1) Hard clustering: each object belongs to a cluster or not 2) Soft clustering (fuzzy clustering): each object belongs to each cluster to a certain degree (e.g. a likelihood of belonging to the cluster) A. Classification of Clustering Traditionally clustering techniques are broadly divided in [1]: 1) Hierarchical methods 2) Partitioning Methods 3) Density Based Algorithms 4) Grid Based Clustering B. Hierarchical Methods Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. Hierarchical clustering generally falls into two types: 1) Agglomerative Algorithm 2) Division Algorithm In hierarchical clustering the data are not partitioned into a particular cluster in a single step. Instead, a series of partitions takes place, which may run from a single cluster containing all objects to n clusters each containing a single object. Hierarchical Clustering is subdivided into agglomerative methods, which proceed by series of fusions of the n objects into groups, and divisive methods, which separate n objects successively into finer groupings. Hierarchical clustering may be represented by a two dimensional diagram known as dendrogram which illustrates the fusions or divisions made at each successive stage of analysis. C. Partitioning Methods The partitioning methods generally result in a set of M clusters, each object belonging to one cluster. Each cluster may be represented by a centroid or a cluster representative; this is some sort of summary description of all the objects contained in a cluster. The precise form of this description will depend on the type of the object which is being clustered. In case where real-valued data is available, the arithmetic mean of the attribute vectors for all objects within a cluster provides an appropriate representative; alternative types of centroid may be required in other cases, e.g., a cluster of documents can be represented by a list of those keywords that occur in some minimum number of documents within a cluster. If the number of the clusters is large, the centroids can be further clustered to produces hierarchy within a dataset. D. Density-Based Algorithms Density-based algorithms are capable of discovering clusters of arbitrary shapes. Also this provides a natural protection against outliers. These algorithms group objects according to specific density objective functions. Density is usually defined as the number of objects in a particular neighborhood of a data objects. In these approaches a given cluster continues growing as long as the number of objects in the neighborhood exceeds some parameter. E. Grid Based Clustering These focus on spatial data i.e. the data that model the geometric structure of objects in the space, their relationships, properties and operations. This technique quantizes the data set into a no. of cells and then work with objects belonging to these cells. They do not relocate points but ratter builds several hierarchical levels of groups of objects. The merging of grids and consequently clusters, does not depend on a distance measure .It is determined by a predefined parameter. 1) Mode of Performance Optimization Performance optimization is foremost objective when we are performing many computations on processor. Performance Optimization of Clustering On GPU Vijal D. Patel1 Prof. Sumitra Menaria2 1, 2 C.S.E. Dept., Parul Institute of Engineering & Technology
  • 2.
    Performance Optimization ofClustering on GPU (IJSRD/Vol. 1/Issue 4/2013/0014) All rights reserved by www.ijsrd.com 863 So following techniques are used for performance optimization: 1) Efficient Resources (Devices ) 2) Optimized computing Algorithm 3) Best utilization of resources with less cost (Parallel processing) This paper proposes a mechanism for High Performance Computing. II. GPU ARCHITECTURE GPU (Graphics Processing Unit) have been of large interest in last decade to achieve high performance computing through massive data parallelism model. GPU has hundreds of core processors. GPUs are designed for graphics. GPU is used to process similar operation on independent vertices & pixels in parallel. GPU is more suitable for stream computations like SIMD (single Instruction multiple data) programming model. For higher efficiency, GPU processes many data elements in parallel with same program module. Last decade, to satisfy demand of high performance computing a tremendous rise in computing power has been observed. Although CPU has been constantly increasing their performance, the growth has been over shadowed by the improvement in performance of GPU. This trend has led to draw the interest of researchers with high computational requirements to look for GPUs as a possible solution. This has led to the emergence of a new term GPGPU (General Purpose computing using Graphical Processing Units). GPGPU is defined as General Purpose computation using GPU and graphics API in applications other than 3D graphics. In order to achieve higher efficiency, GPGPU is designed as a combination between hardware components and software that allows the use of a traditional GPU (Graphics processor) to perform non-graphical computing tasks with higher processing power. [2] Fig. 1 describes GPU architecture. Each GPU has number of Streaming Multiprocessors(SM) i.e.30 SMs and each streaming multiprocessor has eight streaming core processor. This architecture has characteristics to perform same operation on collection of records (Data parallelism). Hence, GPU uses streaming processor as core processor to achieve massive data parallelism. Each NVIDIA GPU has a hundreds of parallel cores and within each core has floating point unit, logic unit (add, sub, mul), move and compare unit, and a branch unit. Figure (1): GPU Architecture [2] Hence, GPU uses streaming processor as core processor to achieve massive data parallelism. Each NVIDIA GPU has a hundreds of parallel cores and within each core has floating point unit, logic unit (add, sub, mul), move and compare unit, and a branch unit. Each core is managed by a thread manager. A thread manager can spawn and manage hundreds of threads per core. A very important feature is that it uses lightweight threads that have little creation overhead and zero- overhead scheduling. The GPU is highly parallel compute coprocessor that has its own device memory with high bandwidth. III. CUDA Compute Unified Device Architecture (CUDA) is a Programming model and the APIs required for performing computation tasks on NVIDIA GPUs. CUDA is a scalable highly parallel and multithreaded API to execute a program on any number of processor without recompiling. CUDA allow a program written for a platform to be executed in other platform without recompilation. CUDA provides efficient programming environment for regular applications. CUDA is designed to support many programming languages as shown in Fig.2. It provides extension to programming languages for performing computation on NVIDIA GPUs. CUDA C is extension to C language with CUDA API to operate on NVIDIA GPU. It allows a programmer to write a program for one thread, and it instantiate for multiple parallel threads. Figure (2): CUDA Interface [2] A. CUDA Programming Execution model Figure (3): CUDA execution model [2] Programming model provided for designing and executing general purpose computation kernel on NVIDIA GPU is named as Compute Unified Device Architecture (CUDA).CUDA programming modal is available for most known operating systems Windows, and Linux. CUDA programming modal allows programmer to use GPU for parallel programming without knowledge of graphics. Fig.3 describes CUDA computational model to be
  • 3.
    Performance Optimization ofClustering on GPU (IJSRD/Vol. 1/Issue 4/2013/0014) All rights reserved by www.ijsrd.com 864 executed on CPU and GPU both. A CUDA program is sequential host program and one or many kernel code as shown in Fig.3.Serial code will be executed by a CPU thread, and Kernel code will be executed by GPU threads grouped in a block. Kernel here specifies portion of computational code to be executed on NVIDIA GPU device. CUDA kernel executes on a grid of thread blocks, where a thread block is a group of threads that work in SIMT(Single Instruction Multiple Thread) model. Threads of each block execute cooperates each other using shared memory and synchronizing its execution. Threads from different blocks operate independently. A grid can be of 1 or 2 dimensions, and a thread block can be of 1, 2, or 3 dimension as shown in Fig.3. A grid can have up to 65535 blocks of threads in each dimension and a thread block can support 768 or 1024 threads depending on GPU architecture. Each thread on a block and a block on grid has a unique identification. IV. MCL ALGORITHM MCL uses two simple algebraic operations, expansion and inflation, on the stochastic (Markov) matrix associated with a graph. The Markov matrix M associated with a graph G is defined by normalizing all columns of the adjacency matrix of G. The clustering process simulates random walks (or flow) within the graph using expansion operations, and then strengthens the flow where it is already strong and weakens it where it is weak using inflation operations. By continuously alternating these two processes, the underlying structure of the graph gradually becomes apparent, and there is Converges to a result with regions with strong internal flow (clusters) separated by boundaries within which flow is absent. 1. // G is a graph, matrix M and a real number r > 1 2. // columns of M with power coefficient r is written Γr(M), and Γr is called the inflation operator with power coefficient r 3. // Γr(Mij) = Mijr / Σr,j(M) 4. add loops to G 5. set Γ to some value 6. set M_1 to be the matrix of random walks on G 7. while (change) 8. { 9. M_2 = M_1 * M_1 # expansion 10. M_1 = Γ(M_2) # inflation 11. change = difference(M_1, M_2) 12. } 13. set CLUSTERING as the components of M_1 CUDA MCL Algorithm Listing (1): MCL Algorithm [12] V. CLUSTERING RESULTS AND ANALYSIS In Fig 4 some of the results of our clusters analysis are shown, comparing number of clusters against cluster sizes, for the clusters output from data sets PPI1 (Bio GRID) Figure (4): Comparison of MCL Execution time for SpMv using ELLAPCK-R format Figure (5): Speedup of MCL Execution time for Sparse- matrix using ELLAPCK-R format
  • 4.
    Performance Optimization ofClustering on GPU (IJSRD/Vol. 1/Issue 4/2013/0014) All rights reserved by www.ijsrd.com 865 Figure (6): Comparison of MCL Execution time for SpMv matrix using ELLPACK-R format on GPU without transfer of data Figure (7): Comparison of MCL Execution time on CPU and GPU using ELLPACK-R Format VI. CONCLUSION AND FUTURE WORK 1) As it has been Clustering Algorithm giving best performance on small data set so we will implement it on large data set and different GPU platform and optimize the Performance. 2) After Observation Parallel Computing is most important mode of Performance Optimization. 3) For high Performance we will implement Many Clustering Algorithm on Different 4) Platform of GPU and find out which platform is best according to Application. REFERENCES [1] Pradeep Rai, Shubha Singh, “A Survey of Clustering Techniques” in proceedings of the International journal of Computer Applications (0975-8887) Volume 7-No.12, October 2010. [2] NVIDIA Corporation, NVIDIA Programming Guide, Version 3.1.1,July 2010. [3] Osama Abu Abbas, “Comparison Between Data Clustering Algorithm” in proceedings of the International Arab Journal of Information Technology, Vol. 5, No. 3, July 2008. [4] A.K. Jain, M.N. Murty, P.J. Flynn, “Data Clustering: A Review” ACM computing Surveys, Vol. 31, No. 3, September 1999. [5] Wenbin Fang, Milan Lu, Xiangye Xiao, Bingsheng He, Qiong Luo, “Frequent Itemset Mining on Graphics Processors” in proceeding of the Fifth International Workshop on Data management on New Hardware, June 28, 2009. [6] Balaji Dhanasekaran, Norman Rubin, “A new method for GPU based Irregular Reductions and Its Application to K-Means Clustering” in proceedings of the International Conference on GPGPU-4, March 2011. [7] Feng Cao, A K.H. Tung, A Zhou, “Scalable Clustering Using Graphics Processors” School of Computing, National University of Singapore, 2004. [8] Ren Wu, Bin Zhang, Meichun Hsu, “Clustering Billions of Data Points Using GPUs” UCHPC- MAW’09, May 2009, Ischia, Italy. [9] Wenjing Ma, Gagan Agrawal, “A Translation System for Enabling Data Mining Applications On GPUs”, ICS’09, June 2009, New York, USA, pp. 400-409. [10] Hanaa M. Hussain, K Benkrid, A Erdogan, H Seker, “Highly Parameterized K-means Clustering on FPGAs: Comparative Results with GPPs and GPUs”, in proceeding of the International Conference on Reconfigurable Computing and FPGAs, IEEE 2011. [11] Dar-Jen Chang, M Kantardzic, M Ouyang, “ Hierarchical clustering with CUDA/GPU”, UCHPC- MAW, 2010. [12] Alhadi Bustamam, Kevin Burrage, Nicholas A. Hamilton, “Fast Parallel Markov Clustering in Bioinformatics Using Massively Parallel Computing on GPU with CUDA and ELLPACK-R Sparse Format”, in proceeding of the IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 9, NO. 3, MAY/JUNE 2012.