Created 07-20-2016 12:44 PM
Hello,
Can you please explain to me what kind of data I got when I use Spark Clustering from Mllib like the following
KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters, numIterations);
Created 07-20-2016 01:00 PM
The data is a Java class that contains the cluster information.
Clusters
Centers
Statistic
...
If you want to work with that you either need to use the spark mlib library to do extraction/scoring etc. OR you can export a lot of these models as PMML which is a XML based standard for Clustering models that is understood by a lot of data mining tools. And can be exported for a lot of the models
kmeansModel.toPMML("/path/to/kmeans.xml")
https://databricks.com/blog/2015/07/02/pmml-support-in-apache-spark-mllib.html
Not all Mlib models support PMML though
Created 07-20-2016 01:00 PM
The data is a Java class that contains the cluster information.
Clusters
Centers
Statistic
...
If you want to work with that you either need to use the spark mlib library to do extraction/scoring etc. OR you can export a lot of these models as PMML which is a XML based standard for Clustering models that is understood by a lot of data mining tools. And can be exported for a lot of the models
kmeansModel.toPMML("/path/to/kmeans.xml")
https://databricks.com/blog/2015/07/02/pmml-support-in-apache-spark-mllib.html
Not all Mlib models support PMML though
Created 07-20-2016 02:31 PM
@Benjamin Leonhardi thank you for your answer ,
Can you tell me please how to extract the Cluster information as List<Integer> where this list contain coordinates for Clustered Data ?
Created 07-20-2016 02:34 PM
The class provides the method clusterCenters
public Vector[] clusterCenters()
Each Vector is a point or cluster center. Or as said export it to PMML
Created 07-20-2016 02:35 PM
If you want the information for your input points which belongs to which clusters you need to use the predict method.
Created 07-20-2016 02:42 PM
This is what I think that I can do
KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters, numIterations); JavaRDD<Integer> clusterPoints = clusters.predict(parsedData); List<Integer> list = clusterPoints.toArray();
Created 07-20-2016 02:49 PM
I think you should look at Spark RDD programming introduction. What you get is an RDD of integers. You can then use Spark functions like map/foreach etc. to do stuff with it. So the question is what you actually want to do. Why do you want a List is my question. You can do rdd.collect to get it all in a big Array on your driver but that is most likely not what you actually want to do.
http://spark.apache.org/docs/latest/programming-guide.html
I.e. clusterPoints.collect() will give you an array of points in your local driver. However it downloads all results to your local driver and doesn't work in parallel anymore. If that works with your data volumes great. But normally you should use the functions like map etc. of spark to make computations in parallel.
Below is a scoring example that runs a scoring point by point so you could do other things in this function as well. Whatever you want to do with the information essentially.
http://blog.sequenceiq.com/blog/2014/07/31/spark-mllib/
<code>val clusters: KMeansModel = KMeans.train(data, K, maxIteration, runs) val vectorsAndClusterIdx = data.map{ point => val prediction = clusters.predict(point) (point.toString, prediction) }