question Re: Averaging RandomForest votes in Spark 1.3.1 in Archives of Support Questions (Read Only)

Averaging RandomForest votes in Spark 1.3.1

Vitor — Mon, 14 Dec 2015 23:06:26 GMT

I'm trying to calculate an averange of randomForest predictions in Spark 1.3.1, since the predicted probability of all trees is available only in 1.5.0.

The best I could do until now is using the function below:

def calculaProbs(dados, modelRF):
    trees = modelRF._java_model.trees()
    nTrees = modelRF.numTrees()
    nPontos = dados.count()
    predictions = np.zeros(nPontos)
    for i in range(nTrees):
        dtm = DecisionTreeModel(trees[i])
        predictions+= dtm.predict(dados.map(lambda x: x.features)).collect()
    predictions = predictions/nTrees
    return predictions

This code is running very slow, as expected, since I'm collecting (collect()) predictions from each Tree and adding them up in Driver. I cannot put the dtm.predit() inside a Map operation in this version of Spark. Here is the Note from documentation: "Note: In Python, predict cannot currently be used within an RDD transformation or action. Call predict directly on the RDD instead."

Any Idea to improve performance? How can I add values from 2 RDDs without collecting their values to a vector?

Re: Averaging RandomForest votes in Spark 1.3.1

dkumar1 — Tue, 15 Dec 2015 01:59:13 GMT

To add two RDD values, the general approach is:

0. Convert the RDDs to pair RDD (key-value). You can use zipWithIndex() to do it if your RDD doesn't have implicit keys.

1. Do a union of the two RDDs

2. Do reduceByKey(_+_) on the new RDD

Don't use collect, it is slow and you'll be limited by the Driver memory anyway.

edit: see here for an example in Scala which you can adapt to Python: http://stackoverflow.com/questions/27395420/concatenating-datasets-of-different-rdds-in-apache-spark-using-scala

Re: Averaging RandomForest votes in Spark 1.3.1

gbraccialli3 — Tue, 15 Dec 2015 10:06:04 GMT

@Vitor Batista

How difficult is it for you to upgrade spark?

When you run spark on yarn (with hortonworks), the upgrade process is really simple, like the steps describe here:

http://hortonworks.com/hadoop-tutorial/apache-spark-1-5-1-technical-preview-with-hdp-2-3/

This is one of the advantages to run spark on yarn instead of spark standalone mode. Have you considered this option as well?

Re: Averaging RandomForest votes in Spark 1.3.1

Vitor — Tue, 15 Dec 2015 17:27:10 GMT

From my own experience, Spark runs much faster in standalone mode. I tried a variety of configurations on Yarn, but I can't get same performance.

I'll try to upgrade. Is there a guide to standalone mode?