Created 12-14-2015 03:06 PM
I'm trying to calculate an averange of randomForest predictions in Spark 1.3.1, since the predicted probability of all trees is available only in 1.5.0.
The best I could do until now is using the function below:
def calculaProbs(dados, modelRF): trees = modelRF._java_model.trees() nTrees = modelRF.numTrees() nPontos = dados.count() predictions = np.zeros(nPontos) for i in range(nTrees): dtm = DecisionTreeModel(trees[i]) predictions+= dtm.predict(dados.map(lambda x: x.features)).collect() predictions = predictions/nTrees return predictions
This code is running very slow, as expected, since I'm collecting (collect()) predictions from each Tree and adding them up in Driver. I cannot put the dtm.predit() inside a Map operation in this version of Spark. Here is the Note from documentation: "Note: In Python, predict cannot currently be used within an RDD transformation or action. Call predict directly on the RDD instead."
Any Idea to improve performance? How can I add values from 2 RDDs without collecting their values to a vector?
Created 12-14-2015 05:59 PM
To add two RDD values, the general approach is:
0. Convert the RDDs to pair RDD (key-value). You can use zipWithIndex() to do it if your RDD doesn't have implicit keys.
1. Do a union of the two RDDs
2. Do reduceByKey(_+_) on the new RDD
Don't use collect, it is slow and you'll be limited by the Driver memory anyway.
edit: see here for an example in Scala which you can adapt to Python: http://stackoverflow.com/questions/27395420/concatenating-datasets-of-different-rdds-in-apache-spark...
Created 12-14-2015 05:59 PM
To add two RDD values, the general approach is:
0. Convert the RDDs to pair RDD (key-value). You can use zipWithIndex() to do it if your RDD doesn't have implicit keys.
1. Do a union of the two RDDs
2. Do reduceByKey(_+_) on the new RDD
Don't use collect, it is slow and you'll be limited by the Driver memory anyway.
edit: see here for an example in Scala which you can adapt to Python: http://stackoverflow.com/questions/27395420/concatenating-datasets-of-different-rdds-in-apache-spark...
Created 12-15-2015 02:06 AM
How difficult is it for you to upgrade spark?
When you run spark on yarn (with hortonworks), the upgrade process is really simple, like the steps describe here:
http://hortonworks.com/hadoop-tutorial/apache-spark-1-5-1-technical-preview-with-hdp-2-3/
This is one of the advantages to run spark on yarn instead of spark standalone mode. Have you considered this option as well?
Created 12-15-2015 09:27 AM
From my own experience, Spark runs much faster in standalone mode. I tried a variety of configurations on Yarn, but I can't get same performance.
I'll try to upgrade. Is there a guide to standalone mode?