Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Averaging RandomForest votes in Spark 1.3.1

Solved Go to solution

Averaging RandomForest votes in Spark 1.3.1

New Contributor

I'm trying to calculate an averange of randomForest predictions in Spark 1.3.1, since the predicted probability of all trees is available only in 1.5.0.

The best I could do until now is using the function below:

def calculaProbs(dados, modelRF):
    trees = modelRF._java_model.trees()
    nTrees = modelRF.numTrees()
    nPontos = dados.count()
    predictions = np.zeros(nPontos)
    for i in range(nTrees):
        dtm = DecisionTreeModel(trees[i])
        predictions+= dtm.predict(dados.map(lambda x: x.features)).collect()
    predictions = predictions/nTrees
    return predictions

This code is running very slow, as expected, since I'm collecting (collect()) predictions from each Tree and adding them up in Driver. I cannot put the dtm.predit() inside a Map operation in this version of Spark. Here is the Note from documentation: "Note: In Python, predict cannot currently be used within an RDD transformation or action. Call predict directly on the RDD instead."

Any Idea to improve performance? How can I add values from 2 RDDs without collecting their values to a vector?

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Averaging RandomForest votes in Spark 1.3.1

New Contributor

To add two RDD values, the general approach is:

0. Convert the RDDs to pair RDD (key-value). You can use zipWithIndex() to do it if your RDD doesn't have implicit keys.

1. Do a union of the two RDDs

2. Do reduceByKey(_+_) on the new RDD

Don't use collect, it is slow and you'll be limited by the Driver memory anyway.

edit: see here for an example in Scala which you can adapt to Python: http://stackoverflow.com/questions/27395420/concatenating-datasets-of-different-rdds-in-apache-spark...

3 REPLIES 3
Highlighted

Re: Averaging RandomForest votes in Spark 1.3.1

New Contributor

To add two RDD values, the general approach is:

0. Convert the RDDs to pair RDD (key-value). You can use zipWithIndex() to do it if your RDD doesn't have implicit keys.

1. Do a union of the two RDDs

2. Do reduceByKey(_+_) on the new RDD

Don't use collect, it is slow and you'll be limited by the Driver memory anyway.

edit: see here for an example in Scala which you can adapt to Python: http://stackoverflow.com/questions/27395420/concatenating-datasets-of-different-rdds-in-apache-spark...

Re: Averaging RandomForest votes in Spark 1.3.1

@Vitor Batista

How difficult is it for you to upgrade spark?

When you run spark on yarn (with hortonworks), the upgrade process is really simple, like the steps describe here:

http://hortonworks.com/hadoop-tutorial/apache-spark-1-5-1-technical-preview-with-hdp-2-3/

This is one of the advantages to run spark on yarn instead of spark standalone mode. Have you considered this option as well?

Re: Averaging RandomForest votes in Spark 1.3.1

New Contributor

From my own experience, Spark runs much faster in standalone mode. I tried a variety of configurations on Yarn, but I can't get same performance.

I'll try to upgrade. Is there a guide to standalone mode?