Support Questions

Find answers, ask questions, and share your expertise

Averaging RandomForest votes in Spark 1.3.1

avatar
Contributor

I'm trying to calculate an averange of randomForest predictions in Spark 1.3.1, since the predicted probability of all trees is available only in 1.5.0.

The best I could do until now is using the function below:

def calculaProbs(dados, modelRF):
    trees = modelRF._java_model.trees()
    nTrees = modelRF.numTrees()
    nPontos = dados.count()
    predictions = np.zeros(nPontos)
    for i in range(nTrees):
        dtm = DecisionTreeModel(trees[i])
        predictions+= dtm.predict(dados.map(lambda x: x.features)).collect()
    predictions = predictions/nTrees
    return predictions

This code is running very slow, as expected, since I'm collecting (collect()) predictions from each Tree and adding them up in Driver. I cannot put the dtm.predit() inside a Map operation in this version of Spark. Here is the Note from documentation: "Note: In Python, predict cannot currently be used within an RDD transformation or action. Call predict directly on the RDD instead."

Any Idea to improve performance? How can I add values from 2 RDDs without collecting their values to a vector?

1 ACCEPTED SOLUTION

avatar
Contributor

To add two RDD values, the general approach is:

0. Convert the RDDs to pair RDD (key-value). You can use zipWithIndex() to do it if your RDD doesn't have implicit keys.

1. Do a union of the two RDDs

2. Do reduceByKey(_+_) on the new RDD

Don't use collect, it is slow and you'll be limited by the Driver memory anyway.

edit: see here for an example in Scala which you can adapt to Python: http://stackoverflow.com/questions/27395420/concatenating-datasets-of-different-rdds-in-apache-spark...

View solution in original post

3 REPLIES 3

avatar
Contributor

To add two RDD values, the general approach is:

0. Convert the RDDs to pair RDD (key-value). You can use zipWithIndex() to do it if your RDD doesn't have implicit keys.

1. Do a union of the two RDDs

2. Do reduceByKey(_+_) on the new RDD

Don't use collect, it is slow and you'll be limited by the Driver memory anyway.

edit: see here for an example in Scala which you can adapt to Python: http://stackoverflow.com/questions/27395420/concatenating-datasets-of-different-rdds-in-apache-spark...

avatar

@Vitor Batista

How difficult is it for you to upgrade spark?

When you run spark on yarn (with hortonworks), the upgrade process is really simple, like the steps describe here:

http://hortonworks.com/hadoop-tutorial/apache-spark-1-5-1-technical-preview-with-hdp-2-3/

This is one of the advantages to run spark on yarn instead of spark standalone mode. Have you considered this option as well?

avatar
Contributor

From my own experience, Spark runs much faster in standalone mode. I tried a variety of configurations on Yarn, but I can't get same performance.

I'll try to upgrade. Is there a guide to standalone mode?