Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Apache Mahout K-Means Algorithm with Map-Reduce, Apache Spark and Apache Pig in Hortonworks Data Platform

Highlighted

Apache Mahout K-Means Algorithm with Map-Reduce, Apache Spark and Apache Pig in Hortonworks Data Platform

New Contributor

Hello,

i am working in my final project with Hortonworks Data Platform and i have to run the K-Means algorithm from Apache Mahout in these differents ways, to analyze the results. Can you help me?

Thank you very much.

2 REPLIES 2
Highlighted

Re: Apache Mahout K-Means Algorithm with Map-Reduce, Apache Spark and Apache Pig in Hortonworks Data Platform

Expert Contributor

You can use K-Means in Spark, Mahout or Pig which all are part of the Hortonworks Data Platform. However I strongly recommend using Spark in this case.

For Spark examples please visit http://spark.apache.org/docs/latest/mllib-clustering.html

K-Means for Map Reduce in Mahout looks deprecated but examples are still available at https://mahout.apache.org/users/clustering/k-means-clustering.html

There is an old blog post explaining the principals of running K-means as an UDF in Pig at: http://hortonworks.com/blog/new-apache-pig-features-part-2-embedding/ - However you need to supply the kmeans.py function as Pig do not include K-Means algorithms or any other machine learning algorithms. I alos found a presentation on how to using Pig for this at http://events.linuxfoundation.org/sites/events/files/slides/Pig_for_DataScience_0.pdf

Highlighted

Re: Apache Mahout K-Means Algorithm with Map-Reduce, Apache Spark and Apache Pig in Hortonworks Data Platform

Expert Contributor

I think you should definitely try first to do it in Spark, which has a thriving Machine Learning solution (choose the algorithms in spark.ml instead of spark.mllib. This last one is not going to be actively developed in the new Spark versions):

http://spark.apache.org/docs/latest/ml-clustering.html#k-means

Because those clustering algorithms are iterative, it goes much faster when using Spark and saving the intermediate results in memory (in some RDDs).

MapReduce (and Pig to a certain extend since Pig translates to MR... or to Tez which also has the same drawback in that case) has to store the intermediate data to disk, resulting in slower executions.

With a MapReduce or Pig job you will be able to launch one iteration of your KMeans algorithm. But then you will have to develop the whole logic yourself in your driver. Quite some work.

You could also have a look at Mahout. I had a look at it in the past (4 years ago) but I did not really like the implementation of the KMeans and I ended developing everything from scratch in MapReduce. What is more, it seems that now the KMeans algorithm is deprecated:

http://mahout.apache.org/users/basics/algorithms.html

Don't have an account?
Coming from Hortonworks? Activate your account here