4

How do I get cluster labels when I use Spark's mllib in pyspark? In sklearn, this can be done easily by

kmeans = MiniBatchKMeans(n_clusters=k,random_state=1)
temp=kmeans.fit(data)
cluster_labels=temp.labels_

In mllib, I run kmeans as :

temp = KMeans.train(data, k, maxIterations=10, runs=10, initializationMode="random")

This returns a KmeansModel object. This class doesn't have any equivalent of sklearn's labels_

I am unable to figure to out how to get the labels in mllib's kmeans

2 Answers 2

7

This is an old question. However, that was then, and this is now, and now in pyspark 2.2 KMeans has no train method and the model has no predict method. The correct way to get the labels is

kmeans = KMeans().setK(k).setSeed(1)
model = kmeans.fit(data)
prediction = model.transform(data).select('prediction').collect()
labels = [p.prediction for p in prediction ]
Sign up to request clarification or add additional context in comments.

2 Comments

You are referring to the newest Spark ML (dataframe-based API), while the question was about the older MLlib, still available as the RDD-based API.
Yes, you are correct. It's just that I was looking myself for a way to get the labels, only found this post, which wasn't helpful because I am using the dataframe-based API, figured it out and decided to share with the world. And I didn't want to create another question that would be branded as "duplicate", so I put it here.
4

Just use predict on the training data:

temp.predict(data)

or

parsedData.map(temp.predict)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.