Created 05-13-2016 11:58 AM
i am working in my final project with Hortonworks Data Platform and i have to run the K-Means algorithm from Apache Mahout in these differents ways, to analyze the results. Can you help me?
Thank you very much.
You can use K-Means in Spark, Mahout or Pig which all are part of the Hortonworks Data Platform. However I strongly recommend using Spark in this case.
For Spark examples please visit http://spark.apache.org/docs/latest/mllib-clustering.html
K-Means for Map Reduce in Mahout looks deprecated but examples are still available at https://mahout.apache.org/users/clustering/k-means-clustering.html
There is an old blog post explaining the principals of running K-means as an UDF in Pig at: http://hortonworks.com/blog/new-apache-pig-features-part-2-embedding/ - However you need to supply the kmeans.py function as Pig do not include K-Means algorithms or any other machine learning algorithms. I alos found a presentation on how to using Pig for this at http://events.linuxfoundation.org/sites/events/files/slides/Pig_for_DataScience_0.pdf
I think you should definitely try first to do it in Spark, which has a thriving Machine Learning solution (choose the algorithms in spark.ml instead of spark.mllib. This last one is not going to be actively developed in the new Spark versions):
Because those clustering algorithms are iterative, it goes much faster when using Spark and saving the intermediate results in memory (in some RDDs).
MapReduce (and Pig to a certain extend since Pig translates to MR... or to Tez which also has the same drawback in that case) has to store the intermediate data to disk, resulting in slower executions.
With a MapReduce or Pig job you will be able to launch one iteration of your KMeans algorithm. But then you will have to develop the whole logic yourself in your driver. Quite some work.
You could also have a look at Mahout. I had a look at it in the past (4 years ago) but I did not really like the implementation of the KMeans and I ended developing everything from scratch in MapReduce. What is more, it seems that now the KMeans algorithm is deprecated: