Support Questions

Find answers, ask questions, and share your expertise

Spark Clustering K-mean

avatar
Rising Star

Hello,

Can you please explain to me what kind of data I got when I use Spark Clustering from Mllib like the following

KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters, numIterations);
1 ACCEPTED SOLUTION

avatar
Master Guru

The data is a Java class that contains the cluster information.

Clusters

Centers

Statistic

...

If you want to work with that you either need to use the spark mlib library to do extraction/scoring etc. OR you can export a lot of these models as PMML which is a XML based standard for Clustering models that is understood by a lot of data mining tools. And can be exported for a lot of the models

kmeansModel.toPMML("/path/to/kmeans.xml")

https://databricks.com/blog/2015/07/02/pmml-support-in-apache-spark-mllib.html

Not all Mlib models support PMML though

View solution in original post

6 REPLIES 6

avatar
Master Guru

The data is a Java class that contains the cluster information.

Clusters

Centers

Statistic

...

If you want to work with that you either need to use the spark mlib library to do extraction/scoring etc. OR you can export a lot of these models as PMML which is a XML based standard for Clustering models that is understood by a lot of data mining tools. And can be exported for a lot of the models

kmeansModel.toPMML("/path/to/kmeans.xml")

https://databricks.com/blog/2015/07/02/pmml-support-in-apache-spark-mllib.html

Not all Mlib models support PMML though

avatar
Rising Star

@Benjamin Leonhardi thank you for your answer ,

Can you tell me please how to extract the Cluster information as List<Integer> where this list contain coordinates for Clustered Data ?

avatar
Master Guru

The class provides the method clusterCenters

public Vector[] clusterCenters()

Each Vector is a point or cluster center. Or as said export it to PMML

avatar
Master Guru

If you want the information for your input points which belongs to which clusters you need to use the predict method.

avatar
Rising Star

@Benjamin Leonhardi

This is what I think that I can do

KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters, numIterations); 

 JavaRDD<Integer> clusterPoints =  clusters.predict(parsedData);
List<Integer> list = clusterPoints.toArray();

avatar
Master Guru

I think you should look at Spark RDD programming introduction. What you get is an RDD of integers. You can then use Spark functions like map/foreach etc. to do stuff with it. So the question is what you actually want to do. Why do you want a List is my question. You can do rdd.collect to get it all in a big Array on your driver but that is most likely not what you actually want to do.

http://spark.apache.org/docs/latest/programming-guide.html

I.e. clusterPoints.collect() will give you an array of points in your local driver. However it downloads all results to your local driver and doesn't work in parallel anymore. If that works with your data volumes great. But normally you should use the functions like map etc. of spark to make computations in parallel.

Below is a scoring example that runs a scoring point by point so you could do other things in this function as well. Whatever you want to do with the information essentially.

http://blog.sequenceiq.com/blog/2014/07/31/spark-mllib/

<code>val clusters: KMeansModel = KMeans.train(data, K, maxIteration, runs)
val vectorsAndClusterIdx = data.map{ point =>
  val prediction = clusters.predict(point)
  (point.toString, prediction)
}