Support Questions

Vitor · ‎12-14-2015

I'm trying to calculate an averange of randomForest predictions in Spark 1.3.1, since the predicted probability of all trees is available only in 1.5.0.

The best I could do until now is using the function below:

def calculaProbs(dados, modelRF):
    trees = modelRF._java_model.trees()
    nTrees = modelRF.numTrees()
    nPontos = dados.count()
    predictions = np.zeros(nPontos)
    for i in range(nTrees):
        dtm = DecisionTreeModel(trees[i])
        predictions+= dtm.predict(dados.map(lambda x: x.features)).collect()
    predictions = predictions/nTrees
    return predictions

This code is running very slow, as expected, since I'm collecting (collect()) predictions from each Tree and adding them up in Driver. I cannot put the dtm.predit() inside a Map operation in this version of Spark. Here is the Note from documentation: "Note: In Python, predict cannot currently be used within an RDD transformation or action. Call predict directly on the RDD instead."

Any Idea to improve performance? How can I add values from 2 RDDs without collecting their values to a vector?

dkumar1 · ‎12-14-2015

To add two RDD values, the general approach is:

0. Convert the RDDs to pair RDD (key-value). You can use zipWithIndex() to do it if your RDD doesn't have implicit keys.

1. Do a union of the two RDDs

2. Do reduceByKey(_+_) on the new RDD

Don't use collect, it is slow and you'll be limited by the Driver memory anyway.

edit: see here for an example in Scala which you can adapt to Python: http://stackoverflow.com/questions/27395420/concatenating-datasets-of-different-rdds-in-apache-spark...

View solution in original post

dkumar1 · ‎12-14-2015

To add two RDD values, the general approach is:

0. Convert the RDDs to pair RDD (key-value). You can use zipWithIndex() to do it if your RDD doesn't have implicit keys.

1. Do a union of the two RDDs

2. Do reduceByKey(_+_) on the new RDD

Don't use collect, it is slow and you'll be limited by the Driver memory anyway.

edit: see here for an example in Scala which you can adapt to Python: http://stackoverflow.com/questions/27395420/concatenating-datasets-of-different-rdds-in-apache-spark...

gbraccialli3 · ‎12-15-2015

@Vitor Batista

How difficult is it for you to upgrade spark?

When you run spark on yarn (with hortonworks), the upgrade process is really simple, like the steps describe here:

http://hortonworks.com/hadoop-tutorial/apache-spark-1-5-1-technical-preview-with-hdp-2-3/

This is one of the advantages to run spark on yarn instead of spark standalone mode. Have you considered this option as well?

Vitor · ‎12-15-2015

From my own experience, Spark runs much faster in standalone mode. I tried a variety of configurations on Yarn, but I can't get same performance.

I'll try to upgrade. Is there a guide to standalone mode?

Cloudera Community

Support Questions

Averaging RandomForest votes in Spark 1.3.1

Spark Python Integration Test Result Exceptions

Apache Spark and Iceberg Supportability Matrix

Zookeeper average client session timeout

Apache phoenix 4.5.1 with spark 1.3.1 Classnotfoun...

RandomForest causing Heap Space error

PyCharm and Spark Connect Quickstart in Cloudera D...

JupyterLab and Spark Connect Quickstart in Clouder...

Spark 1.3.1 on HDP 2.2.8 accessing Hbase?

RandomForest causing aborting spark session

Product Availability: HDP 2.5.3.0, Ambari 2.4.2.0 ...