K-means++ Seeding Algorithm, 

 Implementation in MLDemos!

             Renaud Richardet!
             Brain Mind Institute !
       Ecole Polytechnique Fédérale 

      de Lausanne (EPFL), Switzerland!
          renaud.richardet@epfl.ch !
                      !
K-means!
•  K-means: widely used clustering technique!
•  Initialization: blind random on input data!
•  Drawback: very sensitive to choice of initial cluster
   centers (seeds)!
•  Local optimal can be arbitrarily bad wrt. objective
   function, compared to global optimal clustering!
K-means++!
•  A seeding technique for k-means

   from Arthur and Vassilvitskii [2007]!
•  Idea: spread the k initial cluster centers away from
   each other.!
•  O(log k)-competitive with the optimal clustering"
•  substantial convergence time speedups (empirical)!
Algorithm!




c	
  ∈	
  C:	
  cluster	
  center	
  
x	
  ∈	
  	
  X:	
  data	
  point	
  
D(x):	
  distance	
  between	
  x	
  and	
  the	
  nearest	
  ck	
  that	
  has	
  already	
  been	
  chosen	
  	
  
	
  
Implementation!
•  Based on Apache Commons Math’s
   KMeansPlusPlusClusterer and 

   Arthur’s [2007] implementation!
•  Implemented directly in MLDemos’ core!
Implementation Test Dataset: 4 squares (n=16)!
Expected: 4 nice clusters!
Sample Output!
	
  1:	
  first	
  cluster	
  center	
  0	
  at	
  rand:	
  x=4	
  [-­‐2.0;	
  2.0]	
  
	
  1:	
  initial	
  minDist	
  for	
  0	
  [-­‐1.0;-­‐1.0]	
  =	
  10.0	
  
	
  1:	
  initial	
  minDist	
  for	
  1	
  [	
  2.0;	
  1.0]	
  =	
  17.0	
  
	
  1:	
  initial	
  minDist	
  for	
  2	
  [	
  1.0;-­‐1.0]	
  =	
  18.0	
  
	
  1:	
  initial	
  minDist	
  for	
  3	
  [-­‐1.0;-­‐2.0]	
  =	
  17.0	
  
	
  1:	
  initial	
  minDist	
  for	
  5	
  [	
  2.0;	
  2.0]	
  =	
  16.0	
  
	
  1:	
  initial	
  minDist	
  for	
  6	
  [	
  2.0;-­‐2.0]	
  =	
  32.0	
  
	
  1:	
  initial	
  minDist	
  for	
  7	
  [-­‐1.0;	
  2.0]	
  =	
  	
  1.0	
  
	
  1:	
  initial	
  minDist	
  for	
  8	
  [-­‐2.0;-­‐2.0]	
  =	
  16.0	
  
	
  1:	
  initial	
  minDist	
  for	
  9	
  [	
  1.0;	
  1.0]	
  =	
  10.0	
  
	
  1:	
  initial	
  minDist	
  for	
  10[	
  2.0;-­‐1.0]	
  =	
  25.0	
  
	
  1:	
  initial	
  minDist	
  for	
  11[-­‐2.0;-­‐1.0]	
  =	
  	
  9.0	
  
	
  	
  	
  	
  […]	
  
	
  2:	
  picking	
  cluster	
  center	
  1	
  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
	
  3:	
  	
  	
  distSqSum=3345.0	
  
	
  3:	
  	
  	
  random	
  index	
  1532.706909	
  
	
  4:	
  	
  new	
  cluster	
  point:	
  x=6	
  [2.0;-­‐2.0]	
  	
  
Sample Output (2)!
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  0	
  [-­‐1.0;-­‐1.0]	
  =	
  10.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  1	
  [	
  2.0;	
  1.0]	
  =	
  	
  9.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  2	
  [	
  1.0;-­‐1.0]	
  =	
  	
  2.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  3	
  [-­‐1.0;-­‐2.0]	
  =	
  	
  9.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  5	
  [	
  2.0;	
  2.0]	
  =	
  16.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  7	
  [-­‐1.0;	
  2.0]	
  =	
  25.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  8	
  [-­‐2.0;-­‐2.0]	
  =	
  16.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  9	
  [	
  1.0;	
  1.0]	
  =	
  10.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  10[2.0	
  ;-­‐1.0]	
  =	
  	
  1.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  11[-­‐2.0;-­‐1.0]	
  =	
  17.0	
  
              	
  […]	
  
	
  2:	
  picking	
  cluster	
  center	
  2	
  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
	
  3:	
  	
  	
  distSqSum=961.0	
  
	
  3:	
  	
  	
  random	
  index	
  103.404701	
  
	
  4:	
  	
  	
  new	
  cluster	
  point:	
  x=1	
  [2.0;1.0]	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  0	
  [-­‐1.0;-­‐1.0]	
  =	
  13.0	
  
	
  […]	
  
Evaluation on Test Dataset!
•  200 clustering runs, each with and without k-
   means++ initialization!
•  Measure RSS (intra-class variance)!

•  K-means!

   optimal clustering 115 times (57.5%) !
•  K-means++ !

   optimal clustering 182 times (91%)!
Comparison of the frequency distribution of
RSS values between k-means and k-means
++ on the evaluation dataset (n=200)!
Evaluation on Real Dataset!
•  UCI’s Water Treatment Plant data set

   daily measures of sensors in an urban waste water
   treatment plant (n=396, d=38)!
•  Sampled two times 500 clustering runs for k-means
   and k-means++ with k=13, and recorded RSS!




•  Difference highly significant (P < 0.0001) !
Comparison of the frequency distribution of
RSS values between k-means and k-means
++ on the UCI real world dataset (n=500)!
Alternatives Seeding Algorithms!
•  Extensive research into seeding techniques for k-
   means.!
•  Steinley [2007]: evaluated 12 different techniques
   (omitting k-means++). Recommends multiple
   random starting points for general use.!
•  Maitra [2011] evaluated 11 techniques (including k-
   means++). Unable to provide recommendations
   when evaluating nine standard real-world datasets. !
•  Maitra analyzed simulated datasets and
   recommends using Milligan’s [1980] or Mirkin’s
   [2005] seeding technique, and Bradley’s [1998]
   when dataset is very large.!
Conclusions and Future Work!
•  Using a synthetic test dataset and a real world
   dataset, we showed that our implementation of
   the k-means++ seeding procedure in the
   MLDemos software package yields a significant
   reduction of the RSS. !
•  A short literature survey revealed that many
   seeding procedures exist for k-means, and that
   some alternatives to k-means++ might yield
   even larger improvements.!
References!
•    Arthur, D. & Vassilvitskii, S.: “k-means++: The advantages of careful
     seeding”. Proceedings of the eighteenth annual ACM-SIAM symposium on
     Discrete algorithms 1027–1035 (2007).!
•    Bahmani, B., Moseley, B., Vattani, A., Kumar, R. & Vassilvitskii, S.: “Scalable
     K-Means+”. Unpublished working paper available at
     http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf (2012).!
•    Bradley P. S. & Fayyad U. M.: “Refining initial points. for K-Means
     clustering”. Proc. 15th International Conf. on Machine Learning, 91-99
     (1998).!
•    Maitra, R., Peterson, A. D. & Ghosh, A. P.: “A systematic evaluation of
     different methods for initializing the K-means clustering algorithm”.
     Unpublished working paper available at http://apghosh.public.iastate.edu/
     files/IEEEclust2.pdf (2011).!
•    Milligan G. W.: “The validation of four ultrametric clustering algorithms”.
     Pattern Recognition, vol. 12, 41–50 (1980). !
•    Mirkin B.: “Clustering for data mining: A data recovery approach”. Chapman
     and Hall (2005). !
•    Steinley, D. & Brusco, M. J.: “Initializing k-means batch clustering: A critical
     evaluation of several techniques”. Journal of Classification 24, 99–121
     (2007).!

Kmeans plusplus

  • 1.
    K-means++ Seeding Algorithm,
 Implementation in MLDemos! Renaud Richardet! Brain Mind Institute ! Ecole Polytechnique Fédérale 
 de Lausanne (EPFL), Switzerland! renaud.richardet@epfl.ch ! !
  • 2.
    K-means! •  K-means: widelyused clustering technique! •  Initialization: blind random on input data! •  Drawback: very sensitive to choice of initial cluster centers (seeds)! •  Local optimal can be arbitrarily bad wrt. objective function, compared to global optimal clustering!
  • 3.
    K-means++! •  A seedingtechnique for k-means
 from Arthur and Vassilvitskii [2007]! •  Idea: spread the k initial cluster centers away from each other.! •  O(log k)-competitive with the optimal clustering" •  substantial convergence time speedups (empirical)!
  • 4.
    Algorithm! c  ∈  C:  cluster  center   x  ∈    X:  data  point   D(x):  distance  between  x  and  the  nearest  ck  that  has  already  been  chosen      
  • 5.
    Implementation! •  Based onApache Commons Math’s KMeansPlusPlusClusterer and 
 Arthur’s [2007] implementation! •  Implemented directly in MLDemos’ core!
  • 6.
  • 7.
  • 8.
    Sample Output!  1:  first  cluster  center  0  at  rand:  x=4  [-­‐2.0;  2.0]    1:  initial  minDist  for  0  [-­‐1.0;-­‐1.0]  =  10.0    1:  initial  minDist  for  1  [  2.0;  1.0]  =  17.0    1:  initial  minDist  for  2  [  1.0;-­‐1.0]  =  18.0    1:  initial  minDist  for  3  [-­‐1.0;-­‐2.0]  =  17.0    1:  initial  minDist  for  5  [  2.0;  2.0]  =  16.0    1:  initial  minDist  for  6  [  2.0;-­‐2.0]  =  32.0    1:  initial  minDist  for  7  [-­‐1.0;  2.0]  =    1.0    1:  initial  minDist  for  8  [-­‐2.0;-­‐2.0]  =  16.0    1:  initial  minDist  for  9  [  1.0;  1.0]  =  10.0    1:  initial  minDist  for  10[  2.0;-­‐1.0]  =  25.0    1:  initial  minDist  for  11[-­‐2.0;-­‐1.0]  =    9.0          […]    2:  picking  cluster  center  1  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    3:      distSqSum=3345.0    3:      random  index  1532.706909    4:    new  cluster  point:  x=6  [2.0;-­‐2.0]    
  • 9.
    Sample Output (2)!  4:      updating  minDist  for  0  [-­‐1.0;-­‐1.0]  =  10.0    4:      updating  minDist  for  1  [  2.0;  1.0]  =    9.0    4:      updating  minDist  for  2  [  1.0;-­‐1.0]  =    2.0    4:      updating  minDist  for  3  [-­‐1.0;-­‐2.0]  =    9.0    4:      updating  minDist  for  5  [  2.0;  2.0]  =  16.0    4:      updating  minDist  for  7  [-­‐1.0;  2.0]  =  25.0    4:      updating  minDist  for  8  [-­‐2.0;-­‐2.0]  =  16.0    4:      updating  minDist  for  9  [  1.0;  1.0]  =  10.0    4:      updating  minDist  for  10[2.0  ;-­‐1.0]  =    1.0    4:      updating  minDist  for  11[-­‐2.0;-­‐1.0]  =  17.0    […]    2:  picking  cluster  center  2  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    3:      distSqSum=961.0    3:      random  index  103.404701    4:      new  cluster  point:  x=1  [2.0;1.0]    4:      updating  minDist  for  0  [-­‐1.0;-­‐1.0]  =  13.0    […]  
  • 10.
    Evaluation on TestDataset! •  200 clustering runs, each with and without k- means++ initialization! •  Measure RSS (intra-class variance)! •  K-means!
 optimal clustering 115 times (57.5%) ! •  K-means++ !
 optimal clustering 182 times (91%)!
  • 11.
    Comparison of thefrequency distribution of RSS values between k-means and k-means ++ on the evaluation dataset (n=200)!
  • 12.
    Evaluation on RealDataset! •  UCI’s Water Treatment Plant data set
 daily measures of sensors in an urban waste water treatment plant (n=396, d=38)! •  Sampled two times 500 clustering runs for k-means and k-means++ with k=13, and recorded RSS! •  Difference highly significant (P < 0.0001) !
  • 13.
    Comparison of thefrequency distribution of RSS values between k-means and k-means ++ on the UCI real world dataset (n=500)!
  • 14.
    Alternatives Seeding Algorithms! • Extensive research into seeding techniques for k- means.! •  Steinley [2007]: evaluated 12 different techniques (omitting k-means++). Recommends multiple random starting points for general use.! •  Maitra [2011] evaluated 11 techniques (including k- means++). Unable to provide recommendations when evaluating nine standard real-world datasets. ! •  Maitra analyzed simulated datasets and recommends using Milligan’s [1980] or Mirkin’s [2005] seeding technique, and Bradley’s [1998] when dataset is very large.!
  • 15.
    Conclusions and FutureWork! •  Using a synthetic test dataset and a real world dataset, we showed that our implementation of the k-means++ seeding procedure in the MLDemos software package yields a significant reduction of the RSS. ! •  A short literature survey revealed that many seeding procedures exist for k-means, and that some alternatives to k-means++ might yield even larger improvements.!
  • 16.
    References! •  Arthur, D. & Vassilvitskii, S.: “k-means++: The advantages of careful seeding”. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms 1027–1035 (2007).! •  Bahmani, B., Moseley, B., Vattani, A., Kumar, R. & Vassilvitskii, S.: “Scalable K-Means+”. Unpublished working paper available at http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf (2012).! •  Bradley P. S. & Fayyad U. M.: “Refining initial points. for K-Means clustering”. Proc. 15th International Conf. on Machine Learning, 91-99 (1998).! •  Maitra, R., Peterson, A. D. & Ghosh, A. P.: “A systematic evaluation of different methods for initializing the K-means clustering algorithm”. Unpublished working paper available at http://apghosh.public.iastate.edu/ files/IEEEclust2.pdf (2011).! •  Milligan G. W.: “The validation of four ultrametric clustering algorithms”. Pattern Recognition, vol. 12, 41–50 (1980). ! •  Mirkin B.: “Clustering for data mining: A data recovery approach”. Chapman and Hall (2005). ! •  Steinley, D. & Brusco, M. J.: “Initializing k-means batch clustering: A critical evaluation of several techniques”. Journal of Classification 24, 99–121 (2007).!