- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Averaging RandomForest votes in Spark 1.3.1
- Labels:
-
Apache Spark
Created ‎12-14-2015 03:06 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm trying to calculate an averange of randomForest predictions in Spark 1.3.1, since the predicted probability of all trees is available only in 1.5.0.
The best I could do until now is using the function below:
def calculaProbs(dados, modelRF): trees = modelRF._java_model.trees() nTrees = modelRF.numTrees() nPontos = dados.count() predictions = np.zeros(nPontos) for i in range(nTrees): dtm = DecisionTreeModel(trees[i]) predictions+= dtm.predict(dados.map(lambda x: x.features)).collect() predictions = predictions/nTrees return predictions
This code is running very slow, as expected, since I'm collecting (collect()) predictions from each Tree and adding them up in Driver. I cannot put the dtm.predit() inside a Map operation in this version of Spark. Here is the Note from documentation: "Note: In Python, predict cannot currently be used within an RDD transformation or action. Call predict directly on the RDD instead."
Any Idea to improve performance? How can I add values from 2 RDDs without collecting their values to a vector?
Created ‎12-14-2015 05:59 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To add two RDD values, the general approach is:
0. Convert the RDDs to pair RDD (key-value). You can use zipWithIndex() to do it if your RDD doesn't have implicit keys.
1. Do a union of the two RDDs
2. Do reduceByKey(_+_) on the new RDD
Don't use collect, it is slow and you'll be limited by the Driver memory anyway.
edit: see here for an example in Scala which you can adapt to Python: http://stackoverflow.com/questions/27395420/concatenating-datasets-of-different-rdds-in-apache-spark...
Created ‎12-14-2015 05:59 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To add two RDD values, the general approach is:
0. Convert the RDDs to pair RDD (key-value). You can use zipWithIndex() to do it if your RDD doesn't have implicit keys.
1. Do a union of the two RDDs
2. Do reduceByKey(_+_) on the new RDD
Don't use collect, it is slow and you'll be limited by the Driver memory anyway.
edit: see here for an example in Scala which you can adapt to Python: http://stackoverflow.com/questions/27395420/concatenating-datasets-of-different-rdds-in-apache-spark...
Created ‎12-15-2015 02:06 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How difficult is it for you to upgrade spark?
When you run spark on yarn (with hortonworks), the upgrade process is really simple, like the steps describe here:
http://hortonworks.com/hadoop-tutorial/apache-spark-1-5-1-technical-preview-with-hdp-2-3/
This is one of the advantages to run spark on yarn instead of spark standalone mode. Have you considered this option as well?
Created ‎12-15-2015 09:27 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
From my own experience, Spark runs much faster in standalone mode. I tried a variety of configurations on Yarn, but I can't get same performance.
I'll try to upgrade. Is there a guide to standalone mode?
