<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Averaging RandomForest votes in Spark 1.3.1 in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Averaging-RandomForest-votes-in-Spark-1-3-1/m-p/99163#M12433</link>
    <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/1218/vabatista.html" nodeid="1218"&gt;@Vitor Batista&lt;/A&gt; &lt;/P&gt;&lt;P&gt;How difficult is it for you to upgrade spark?&lt;/P&gt;&lt;P&gt;When you run spark on yarn (with hortonworks), the upgrade process is really simple, like the steps describe here:&lt;/P&gt;&lt;P&gt;&lt;A target="_blank" href="http://hortonworks.com/hadoop-tutorial/apache-spark-1-5-1-technical-preview-with-hdp-2-3/"&gt;http://hortonworks.com/hadoop-tutorial/apache-spark-1-5-1-technical-preview-with-hdp-2-3/&lt;/A&gt;&lt;/P&gt;&lt;P&gt;This is one of the advantages to run spark on yarn instead of spark standalone mode. Have you considered this option as well?&lt;/P&gt;</description>
    <pubDate>Tue, 15 Dec 2015 10:06:04 GMT</pubDate>
    <dc:creator>gbraccialli3</dc:creator>
    <dc:date>2015-12-15T10:06:04Z</dc:date>
    <item>
      <title>Averaging RandomForest votes in Spark 1.3.1</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Averaging-RandomForest-votes-in-Spark-1-3-1/m-p/99161#M12431</link>
      <description>&lt;P&gt;I'm trying to calculate an averange of randomForest predictions in Spark 1.3.1, since the predicted probability of all trees is available only in 1.5.0.&lt;/P&gt;&lt;P&gt;The best I could do until now is using the function below:&lt;/P&gt;&lt;PRE&gt;def calculaProbs(dados, modelRF):
    trees = modelRF._java_model.trees()
    nTrees = modelRF.numTrees()
    nPontos = dados.count()
    predictions = np.zeros(nPontos)
    for i in range(nTrees):
        dtm = DecisionTreeModel(trees[i])
        predictions+= dtm.predict(dados.map(lambda x: x.features)).collect()
    predictions = predictions/nTrees
    return predictions&lt;/PRE&gt;&lt;P&gt;This code is running very slow, as expected, since I'm collecting (collect()) predictions from each Tree and adding them up in Driver. I cannot put the dtm.predit() inside a Map operation in this version of Spark. Here is the Note from documentation: "Note: In Python, predict cannot currently be used within an RDD transformation or action. Call predict directly on the RDD instead."&lt;/P&gt;&lt;P&gt;Any Idea to improve performance? How can I add values from 2 RDDs without collecting their values to a vector?&lt;/P&gt;</description>
      <pubDate>Mon, 14 Dec 2015 23:06:26 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Averaging-RandomForest-votes-in-Spark-1-3-1/m-p/99161#M12431</guid>
      <dc:creator>Vitor</dc:creator>
      <dc:date>2015-12-14T23:06:26Z</dc:date>
    </item>
    <item>
      <title>Re: Averaging RandomForest votes in Spark 1.3.1</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Averaging-RandomForest-votes-in-Spark-1-3-1/m-p/99162#M12432</link>
      <description>&lt;P&gt;To add two RDD values, the general approach is:&lt;/P&gt;&lt;P&gt;0. Convert the RDDs to pair RDD (key-value). You can use zipWithIndex() to do it if your RDD doesn't have implicit keys.&lt;/P&gt;&lt;P&gt;1. Do a union of the two RDDs&lt;/P&gt;&lt;P&gt;2. Do reduceByKey(_+_) on the new RDD&lt;/P&gt;&lt;P&gt;Don't use collect, it is slow and you'll be limited by the Driver memory anyway.&lt;/P&gt;&lt;P&gt;edit: see here for an example in Scala which you can adapt to Python: &lt;A href="http://stackoverflow.com/questions/27395420/concatenating-datasets-of-different-rdds-in-apache-spark-using-scala" target="_blank"&gt;http://stackoverflow.com/questions/27395420/concatenating-datasets-of-different-rdds-in-apache-spark-using-scala&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 15 Dec 2015 01:59:13 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Averaging-RandomForest-votes-in-Spark-1-3-1/m-p/99162#M12432</guid>
      <dc:creator>dkumar1</dc:creator>
      <dc:date>2015-12-15T01:59:13Z</dc:date>
    </item>
    <item>
      <title>Re: Averaging RandomForest votes in Spark 1.3.1</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Averaging-RandomForest-votes-in-Spark-1-3-1/m-p/99163#M12433</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/1218/vabatista.html" nodeid="1218"&gt;@Vitor Batista&lt;/A&gt; &lt;/P&gt;&lt;P&gt;How difficult is it for you to upgrade spark?&lt;/P&gt;&lt;P&gt;When you run spark on yarn (with hortonworks), the upgrade process is really simple, like the steps describe here:&lt;/P&gt;&lt;P&gt;&lt;A target="_blank" href="http://hortonworks.com/hadoop-tutorial/apache-spark-1-5-1-technical-preview-with-hdp-2-3/"&gt;http://hortonworks.com/hadoop-tutorial/apache-spark-1-5-1-technical-preview-with-hdp-2-3/&lt;/A&gt;&lt;/P&gt;&lt;P&gt;This is one of the advantages to run spark on yarn instead of spark standalone mode. Have you considered this option as well?&lt;/P&gt;</description>
      <pubDate>Tue, 15 Dec 2015 10:06:04 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Averaging-RandomForest-votes-in-Spark-1-3-1/m-p/99163#M12433</guid>
      <dc:creator>gbraccialli3</dc:creator>
      <dc:date>2015-12-15T10:06:04Z</dc:date>
    </item>
    <item>
      <title>Re: Averaging RandomForest votes in Spark 1.3.1</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Averaging-RandomForest-votes-in-Spark-1-3-1/m-p/99164#M12434</link>
      <description>&lt;P&gt;From my own experience, Spark runs much faster in standalone mode. I tried a variety of configurations on Yarn, but I can't get same performance.&lt;/P&gt;&lt;P&gt;I'll try to upgrade. Is there a guide to standalone mode?&lt;/P&gt;</description>
      <pubDate>Tue, 15 Dec 2015 17:27:10 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Averaging-RandomForest-votes-in-Spark-1-3-1/m-p/99164#M12434</guid>
      <dc:creator>Vitor</dc:creator>
      <dc:date>2015-12-15T17:27:10Z</dc:date>
    </item>
  </channel>
</rss>

