Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark Clustering K-mean

Solved Go to solution

Spark Clustering K-mean

Contributor

Hello,

Can you please explain to me what kind of data I got when I use Spark Clustering from Mllib like the following

KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters, numIterations);
1 ACCEPTED SOLUTION

Accepted Solutions

Re: Spark Clustering K-mean

The data is a Java class that contains the cluster information.

Clusters

Centers

Statistic

...

If you want to work with that you either need to use the spark mlib library to do extraction/scoring etc. OR you can export a lot of these models as PMML which is a XML based standard for Clustering models that is understood by a lot of data mining tools. And can be exported for a lot of the models

kmeansModel.toPMML("/path/to/kmeans.xml")

https://databricks.com/blog/2015/07/02/pmml-support-in-apache-spark-mllib.html

Not all Mlib models support PMML though

6 REPLIES 6

Re: Spark Clustering K-mean

The data is a Java class that contains the cluster information.

Clusters

Centers

Statistic

...

If you want to work with that you either need to use the spark mlib library to do extraction/scoring etc. OR you can export a lot of these models as PMML which is a XML based standard for Clustering models that is understood by a lot of data mining tools. And can be exported for a lot of the models

kmeansModel.toPMML("/path/to/kmeans.xml")

https://databricks.com/blog/2015/07/02/pmml-support-in-apache-spark-mllib.html

Not all Mlib models support PMML though

Re: Spark Clustering K-mean

Contributor

@Benjamin Leonhardi thank you for your answer ,

Can you tell me please how to extract the Cluster information as List<Integer> where this list contain coordinates for Clustered Data ?

Re: Spark Clustering K-mean

The class provides the method clusterCenters

public Vector[] clusterCenters()

Each Vector is a point or cluster center. Or as said export it to PMML

Re: Spark Clustering K-mean

If you want the information for your input points which belongs to which clusters you need to use the predict method.

Re: Spark Clustering K-mean

Contributor

@Benjamin Leonhardi

This is what I think that I can do

KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters, numIterations); 

 JavaRDD<Integer> clusterPoints =  clusters.predict(parsedData);
List<Integer> list = clusterPoints.toArray();

Re: Spark Clustering K-mean

I think you should look at Spark RDD programming introduction. What you get is an RDD of integers. You can then use Spark functions like map/foreach etc. to do stuff with it. So the question is what you actually want to do. Why do you want a List is my question. You can do rdd.collect to get it all in a big Array on your driver but that is most likely not what you actually want to do.

http://spark.apache.org/docs/latest/programming-guide.html

I.e. clusterPoints.collect() will give you an array of points in your local driver. However it downloads all results to your local driver and doesn't work in parallel anymore. If that works with your data volumes great. But normally you should use the functions like map etc. of spark to make computations in parallel.

Below is a scoring example that runs a scoring point by point so you could do other things in this function as well. Whatever you want to do with the information essentially.

http://blog.sequenceiq.com/blog/2014/07/31/spark-mllib/

<code>val clusters: KMeansModel = KMeans.train(data, K, maxIteration, runs)
val vectorsAndClusterIdx = data.map{ point =>
  val prediction = clusters.predict(point)
  (point.toString, prediction)
}
Don't have an account?
Coming from Hortonworks? Activate your account here