<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Rdd/DataFrame/DataSet Performance Tuning in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Rdd-DataFrame-DataSet-Performance-Tuning/m-p/144671#M35677</link>
    <description>&lt;P&gt;Unfortunately, I'm doing a full outer join, so I can't filter.&lt;/P&gt;</description>
    <pubDate>Tue, 26 Jul 2016 05:17:54 GMT</pubDate>
    <dc:creator>jestinm</dc:creator>
    <dc:date>2016-07-26T05:17:54Z</dc:date>
    <item>
      <title>Rdd/DataFrame/DataSet Performance Tuning</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Rdd-DataFrame-DataSet-Performance-Tuning/m-p/144668#M35674</link>
      <description>Hello,Right now I'm using DataFrames to perform a df1.groupBy(key).count() on one DataFrame and join with another, df2.
&lt;P&gt;The first, df1, is very large (many gigabytes) compared to df2 (250 Mb).&lt;/P&gt;
Right now I'm running this on a cluster of 5 nodes, 16 cores each, 90 GB RAM each.
It is taking me about 1 hour and 40 minutes to perform the groupBy, count, and join, which seems very slow to me.
Currently I have set the following in my &lt;STRONG&gt;spark-defaults.conf&lt;/STRONG&gt;:

&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;spark.executor.instances&lt;/TD&gt;&lt;TD&gt;24&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;spark.executor.memory&lt;/TD&gt;&lt;TD&gt;10g&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;spark.executor.cores&lt;/TD&gt;&lt;TD&gt;3&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;spark.driver.memory&lt;/TD&gt;&lt;TD&gt;5g&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;spark.sql.autoBroadcastJoinThreshold&lt;/TD&gt;&lt;TD&gt;200Mb&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;/TD&gt;&lt;TD&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;I have a couple of questions regarding tuning for performance as a beginner.&lt;/P&gt;&lt;OL&gt;
&lt;LI&gt;Right now I'm running Spark 1.6.0. Would moving to Spark 2.0 DataSet (or even DataFrames) be much better?&lt;/LI&gt;&lt;LI&gt;What if I used RDDs instead? I know that reduceByKey is better than groupByKey, and DataFrames don't have that method. &lt;/LI&gt;&lt;LI&gt;I think I can do a broadcast join and have set a threshold. Do I need to set it above my second DataFrame size? Do I need to explicitly call broadcast(df2)?&lt;/LI&gt;&lt;LI&gt;What's the point of driver memory?&lt;/LI&gt;&lt;LI&gt;Can anyone point out something wrong with my tuning numbers, or any additional parameters worth checking out?&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Thank you a lot! &lt;/P&gt;&lt;P&gt;Sincerely,
Jestin&lt;/P&gt;</description>
      <pubDate>Sat, 23 Jul 2016 22:33:22 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Rdd-DataFrame-DataSet-Performance-Tuning/m-p/144668#M35674</guid>
      <dc:creator>jestinm</dc:creator>
      <dc:date>2016-07-23T22:33:22Z</dc:date>
    </item>
    <item>
      <title>Re: Rdd/DataFrame/DataSet Performance Tuning</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Rdd-DataFrame-DataSet-Performance-Tuning/m-p/144669#M35675</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/11524/jestinm.html" nodeid="11524"&gt;@jestin ma&lt;/A&gt; &lt;/P&gt;&lt;P&gt;I wonder if doing a filter would help rather than a join and achive the same results. So instead of join, is it possible to do something like this?&lt;/P&gt;&lt;P&gt;df1.filter(df2).groupBy(key).count().&lt;/P&gt;</description>
      <pubDate>Sun, 24 Jul 2016 01:01:13 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Rdd-DataFrame-DataSet-Performance-Tuning/m-p/144669#M35675</guid>
      <dc:creator>mqureshi</dc:creator>
      <dc:date>2016-07-24T01:01:13Z</dc:date>
    </item>
    <item>
      <title>Re: Rdd/DataFrame/DataSet Performance Tuning</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Rdd-DataFrame-DataSet-Performance-Tuning/m-p/144670#M35676</link>
      <description>&lt;P&gt;Use a broadcast variable for the smaller table, to join it to the larger table.  This will implement a broadcast join, the same as a mapside join and save you quite a bit of network IO and time.&lt;/P&gt;</description>
      <pubDate>Sun, 24 Jul 2016 06:05:34 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Rdd-DataFrame-DataSet-Performance-Tuning/m-p/144670#M35676</guid>
      <dc:creator>jwiden</dc:creator>
      <dc:date>2016-07-24T06:05:34Z</dc:date>
    </item>
    <item>
      <title>Re: Rdd/DataFrame/DataSet Performance Tuning</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Rdd-DataFrame-DataSet-Performance-Tuning/m-p/144671#M35677</link>
      <description>&lt;P&gt;Unfortunately, I'm doing a full outer join, so I can't filter.&lt;/P&gt;</description>
      <pubDate>Tue, 26 Jul 2016 05:17:54 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Rdd-DataFrame-DataSet-Performance-Tuning/m-p/144671#M35677</guid>
      <dc:creator>jestinm</dc:creator>
      <dc:date>2016-07-26T05:17:54Z</dc:date>
    </item>
    <item>
      <title>Re: Rdd/DataFrame/DataSet Performance Tuning</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Rdd-DataFrame-DataSet-Performance-Tuning/m-p/144672#M35678</link>
      <description>&lt;P&gt;Some thoughts/questions:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;What does the key distribution look like? If lumpy, perhaps a repartition would help? Looking at the Spark UI might give some insight into the bottleneck.&lt;/LI&gt;&lt;LI&gt;Bumping up spark.sql.autoBroadcastJoinThreshold to 300M might help ensure that the map-side join (broadcast join) happens. Check &lt;A target="_blank" href="http://spark.apache.org/docs/latest/sql-programming-guide.html#performance-tuning"&gt;here&lt;/A&gt; though because it notes "...that currently statistics are only supported for Hive Metastore tables where the command &lt;CODE&gt;ANALYZE TABLE &amp;lt;tableName&amp;gt; COMPUTE STATISTICS noscan&lt;/CODE&gt; has been run."&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Other answers:&lt;/P&gt;&lt;OL&gt;
&lt;LI&gt;Right now I'm running Spark 1.6.0. Would moving to Spark 2.0 DataSet (or even DataFrames) be much better?&lt;OL&gt;&lt;LI&gt;Doubt it. There's probably opportunities to tune with what you have that would be needed nonetheless in 2.0.&lt;/LI&gt;&lt;/OL&gt;&lt;/LI&gt;&lt;LI&gt;What if I used RDDs instead? I know that reduceByKey is better than groupByKey, and DataFrames don't have that method.&lt;OL&gt;&lt;LI&gt;If you want to post more of your code, we can comment on that. Hard to tell if the RDD API's more granular control would help you without the bigger picture.&lt;/LI&gt;&lt;/OL&gt;&lt;/LI&gt;&lt;LI&gt;I think I can do a broadcast join and have set a threshold. Do I need to set it above my second DataFrame size? Do I need to explicitly call broadcast(df2)?&lt;OL&gt;&lt;LI&gt;Yes, the threshold matters and should be above the data size. Think of this like a map-side join. No, you should not need to call broadcast explicitly. However, if you did you cannot broadcast the dataframe itself. It would have to be a collection loaded up in the driver. Check here for info on broadcast variables: &lt;A href="https://spark.apache.org/docs/0.8.1/scala-programming-guide.html#broadcast-variables" target="_blank"&gt;https://spark.apache.org/docs/0.8.1/scala-programming-guide.html#broadcast-variables&lt;/A&gt; &lt;/LI&gt;&lt;/OL&gt;&lt;/LI&gt;&lt;LI&gt;What's the point of driver memory?&lt;OL&gt;&lt;LI&gt;When performing something like "collect" it will bring results back to the driver. If you're collecting a lot of results, you'll need to worry about that driver-memory setting.&lt;/LI&gt;&lt;/OL&gt;&lt;/LI&gt;&lt;LI&gt;Can anyone point out something wrong with my tuning numbers, or any additional parameters worth checking out?&lt;OL&gt;&lt;LI&gt;Looks good, but we could give more assistance if we have the full code. Also, look at the Spark UI and walk the DAG to see where the bottleneck is.&lt;/LI&gt;&lt;/OL&gt;&lt;/LI&gt;&lt;/OL&gt;</description>
      <pubDate>Tue, 26 Jul 2016 22:21:35 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Rdd-DataFrame-DataSet-Performance-Tuning/m-p/144672#M35678</guid>
      <dc:creator>clukasik</dc:creator>
      <dc:date>2016-07-26T22:21:35Z</dc:date>
    </item>
  </channel>
</rss>

