get cluster labels in mllib kmeans pyspark

Question

How do I get cluster labels when I use Spark's mllib in pyspark? In sklearn, this can be done easily by

kmeans = MiniBatchKMeans(n_clusters=k,random_state=1)
temp=kmeans.fit(data)
cluster_labels=temp.labels_

In mllib, I run kmeans as :

temp = KMeans.train(data, k, maxIterations=10, runs=10, initializationMode="random")

This returns a KmeansModel object. This class doesn't have any equivalent of sklearn's labels_

I am unable to figure to out how to get the labels in mllib's kmeans

desertnaut · Accepted Answer · 2017-09-08 13:12:54Z

7

This is an old question. However, that was then, and this is now, and now in pyspark 2.2 KMeans has no train method and the model has no predict method. The correct way to get the labels is

kmeans = KMeans().setK(k).setSeed(1)
model = kmeans.fit(data)
prediction = model.transform(data).select('prediction').collect()
labels = [p.prediction for p in prediction ]

edited Sep 8, 2017 at 13:12

desertnaut

60.8k32 gold badges155 silver badges183 bronze badges

answered Sep 8, 2017 at 4:43

David Makovoz

1,9383 gold badges18 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

desertnaut Over a year ago

You are referring to the newest Spark ML (dataframe-based API), while the question was about the older MLlib, still available as the RDD-based API.

David Makovoz Over a year ago

Yes, you are correct. It's just that I was looking myself for a way to get the labels, only found this post, which wasn't helpful because I am using the dataframe-based API, figured it out and decided to share with the world. And I didn't want to create another question that would be branded as "duplicate", so I put it here.

zero323 · Accepted Answer · 2016-04-01 05:26:36Z

4

Just use predict on the training data:

temp.predict(data)

or

parsedData.map(temp.predict)

answered Apr 1, 2016 at 5:26

zero323

331k108 gold badges982 silver badges958 bronze badges

Collectives™ on Stack Overflow

get cluster labels in mllib kmeans pyspark

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related