Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Does Spark ML and Mahout support Random Forest in clusters? How about SVMs?

avatar
Explorer

Does Spark ML and Mahout support Random Forest in clusters? Are there examples. If not why; what would be closest that are currently supported? How about SVMs ... examples

 

Many thnaks mates!

1 ACCEPTED SOLUTION

avatar
Master Collaborator

Yes I know what commutativity and associativity are, I was wondering how it related to Hadoop and decision forests. In theory a reduce function should be commutative and associative, but in practice it does not need to be in MapReduce, and a MapReduce as a unit is not, and certainly Spark is not. There is no practical computation paradigm limitation of this form.

 

I looked into the MLlib RDF code and it does look like it selects features too at random, depending on the configuration. So you could say it bags by examples and features.

 

The oryx implementation also certainly does all of what you describe.

https://github.com/cloudera/oryx/tree/master/rdf-computation

View solution in original post

8 REPLIES 8

avatar
Master Collaborator

MLlib supports SVMs in Spark 1.1. It supports Decision Trees in 1.1, and Decision Forests in 1.2, which is not quite yet released.

 

Mahout has an implementation of SVMs and Decision Forests. They are both fairly old and MapReduce-based.

avatar
Explorer

Thnaks Mate! Any word on 1.2 Spark ML Random Forests ... are decision forests jsut classification trees with bagging for datasets and not featured?

 

Chris

avatar
Master Collaborator

Random decision forests in MLlib 1.2 can do classification or regression. Yes it can do bagging. I don't believe it's by feature, no.

avatar
Explorer

The more I look at ML Libraries for clusters, the more I see limitations because of the need to meet communitive and associative constraints. I had used scikitlearn Random Forests on multi core (not clusters) and found it to be superior to SVMs for non-linear classifications with high dim.

 

Are there any good references on what ML (Supervised/Unsupervised and Hybrid)  algorithms can and cannot work on Apache-Like Clusters (Hadoop, Hbase, Spark, Accumulo) and why?

 

Furthermore, are there any good references on what does work well currently and under what conditions?

 

thanks Mate,

 

Chris

avatar
Master Collaborator

Hm, what do you mean by commutative and associative?

and do you mean Hadoop clusters?

 

I'm not sure there's a particular limit to what a Hadoop cluster can do well other than that it's fundamentally a data-parallel paradigm. But most things can be done efficiently in this paradigm, especially random forests. The only things that don't work well are things that require extremely fast async communication -- MPI-style computations.

 

Decision forests are strong and you can certainly do them well on Hadoop.

avatar
Explorer

This is what I ment by communative and associative properties Here is a reference: http://www.mathsisfun.com/associative-commutative-distributive.html . I am not being condecending; I like it because it has pictures and it is not boring to read.

 

Yes I mean Hadoop Clusters.

 

Decision Forests yes. Random Forests See ref: http://en.wikipedia.org/wiki/Random_forest and http://www.whrc.org/education/indonesia/pdf/DecisionTrees_RandomForest_v2.pdf

 

I don't know and would like to know of any current approaches and or existing Linraries for Hadoop Clusters (or Spark)

 

All I have seen thus far is bagging of data sets but not features as described in the reference.

 

Thanks for your assistance!

 

Chris

avatar
Master Collaborator

Yes I know what commutativity and associativity are, I was wondering how it related to Hadoop and decision forests. In theory a reduce function should be commutative and associative, but in practice it does not need to be in MapReduce, and a MapReduce as a unit is not, and certainly Spark is not. There is no practical computation paradigm limitation of this form.

 

I looked into the MLlib RDF code and it does look like it selects features too at random, depending on the configuration. So you could say it bags by examples and features.

 

The oryx implementation also certainly does all of what you describe.

https://github.com/cloudera/oryx/tree/master/rdf-computation

avatar
Explorer

Thanks veru much, I am new to Spark and drank the coolade on the communative and assoviative mandate for Cluster-based algorighms

 

I very much appreciate you providing me an accurate view on implementations.

 

 

I am very interested in the parelization of SNA and ML algorithms on clusters and appreciate any reading/references you can provide me.

 

 

Thanks again for your time and insignt and I appreciate any further insight you can provide.

 

In short; thanks Mate!

 

Chris