<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Spark Broadcast Hash Join failing on 800+ million to 1.5 million in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Broadcast-Hash-Join-failing-on-800-million-to-1-5/m-p/151636#M44609</link>
    <description>&lt;P&gt;Run it on a cluster so you have more RAM .  Running on one machine won't support that data size&lt;/P&gt;&lt;PRE&gt;16/10/25 18:30:41 INFO BlockManager: Reporting 4 blocks to the master.
Exception in thread "qtp1394524874-84" java.lang.OutOfMemoryError: GC overhead limit exceeded
       	at java.util.HashMap$KeySet.iterator(HashMap.java:912)
       	at java.util.HashSet.iterator(HashSet.java:172)
       	at sun.nio.ch.Util$2.iterator(Util.java:243)
       	at org.spark-project.jetty.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:600)
       	at org.spark-project.jetty.io.nio.SelectorManager$1.run(SelectorManager.java:290)
       	at org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
       	at org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
       	at java.lang.Thread.run(Thread.java:745)&lt;/PRE&gt;</description>
    <pubDate>Fri, 28 Oct 2016 01:44:31 GMT</pubDate>
    <dc:creator>TimothySpann</dc:creator>
    <dc:date>2016-10-28T01:44:31Z</dc:date>
    <item>
      <title>Spark Broadcast Hash Join failing on 800+ million to 1.5 million</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Broadcast-Hash-Join-failing-on-800-million-to-1-5/m-p/151632#M44605</link>
      <description>&lt;P&gt;I'm a beginner in Spark, trying to join a &lt;STRONG&gt;1.5 million data set (100.3 MB)&lt;/STRONG&gt; with &lt;STRONG&gt;800+ million data set (15.6 GB)&lt;/STRONG&gt; join using broadcast hash join with Spark Data frame API. The application completes in about 5 seconds with 80 tasks. As I try to run a "joinDF.show()" or "collect" command at the very last step, the application tasks fully completes but my console hangs right after and I get all these errors after some time.&lt;/P&gt;&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;The first line in the error:&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;&lt;PRE&gt;Exception in thread "broadcast-hash-join-0" java.lang.OutOfMemoryError: GC overhead limit exceeded&lt;/PRE&gt;&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;Full Error log:&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.cloudera.com/"&gt;https://www.dropbox.com/s/te59cxnm4j5rb3p/log.txt?dl=0&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;Spark-Shell (Fire up command):&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;&lt;PRE&gt;spark-shell \
  --executor-memory 16G \
  --num-executors 800 \&lt;/PRE&gt;&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;Spark-Scala Code:&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;&lt;PRE&gt;case class Small(col_1: String, col_2:String, col_3:String, col_4:Int, col_5:Int, col_6:String)

val sm_data = sc.textFile("/small_hadoop_data")

val smallDataframe = sm_data.map(_.split("\\|")).map(attr =&amp;gt; Small(attr(0).toString, attr(1).toString, attr(2).toString, attr(3).toInt, attr(4).toInt, attr(5).toString)).toDF()

smallDataframe.registerTempTable("Small")  // Row Count 1,518,933


val lg_data = sc.textFile("/very_large_hadoop_data")

case class Large(col_1: Int, col_2: String, col_3: Int)

val LargeDataFrame = lg_data.map(_.split("\\|")).map(attr =&amp;gt; Large(attr(0).toInt, attr(2).toString, attr(3).toInt)).toDF()

LargeDataFrame.registerTempTable("Very_Large") // Row Count: 849,064,470

val joinDF = LargeDataFrame.join(broadcast(smallDataFrame), "key")&lt;/PRE&gt;</description>
      <pubDate>Thu, 27 Oct 2016 02:13:23 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Broadcast-Hash-Join-failing-on-800-million-to-1-5/m-p/151632#M44605</guid>
      <dc:creator>adnanalvee</dc:creator>
      <dc:date>2016-10-27T02:13:23Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Broadcast Hash Join failing on 800+ million to 1.5 million</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Broadcast-Hash-Join-failing-on-800-million-to-1-5/m-p/151633#M44606</link>
      <description>&lt;P&gt;Quick question. Have you tried without the broadcast? 1.5 million records is not that small, that you should send it to 800 executors. &lt;/P&gt;&lt;P&gt;Also shouldn't you be doing something with the joinDF, e.g. at least a joinDF.count()?&lt;/P&gt;&lt;P&gt;The join by "key" looks interesting as well. Have you considered trying you logic with a smaller dataset first, say 8000 records?&lt;/P&gt;</description>
      <pubDate>Thu, 27 Oct 2016 03:42:28 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Broadcast-Hash-Join-failing-on-800-million-to-1-5/m-p/151633#M44606</guid>
      <dc:creator>mariano_kamp</dc:creator>
      <dc:date>2016-10-27T03:42:28Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Broadcast Hash Join failing on 800+ million to 1.5 million</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Broadcast-Hash-Join-failing-on-800-million-to-1-5/m-p/151634#M44607</link>
      <description>&lt;P&gt;I tried without a broadcast. Gets divided into 895 tasks which will take aroud 10 mins to finish.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Spark UI screen shot:&lt;/STRONG&gt;&lt;A href="https://community.hortonworks.com/questions/63668/spark-broadcast-hash-join-failing-on-800-million-t.html#answer-63861"&gt;http://2.1m.yt/e_tnx0K.png&lt;/A&gt;&lt;/P&gt;&lt;P&gt;But it actually doesn't, i get &lt;STRONG&gt;"out of space"&lt;/STRONG&gt; memory error.&lt;/P&gt;</description>
      <pubDate>Fri, 28 Oct 2016 00:53:59 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Broadcast-Hash-Join-failing-on-800-million-to-1-5/m-p/151634#M44607</guid>
      <dc:creator>adnanalvee</dc:creator>
      <dc:date>2016-10-28T00:53:59Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Broadcast Hash Join failing on 800+ million to 1.5 million</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Broadcast-Hash-Join-failing-on-800-million-to-1-5/m-p/151635#M44608</link>
      <description>&lt;P&gt;How big is your cluster?   Sounds like you may need more RAM.&lt;/P&gt;&lt;P&gt;That's a big join when they come together.&lt;/P&gt;&lt;P&gt;What does the history UI show. &lt;/P&gt;&lt;P&gt;Try 895 executors and 32 G of RAM.&lt;/P&gt;&lt;P&gt;How many nodes in the cluster do you have?  How big are these files in Gigabytes?&lt;/P&gt;&lt;P&gt;How much RAM is available on the cluster?&lt;/P&gt;&lt;P&gt;Do NOT run from the shell.  Run this as compiled code and submit to yarn-cluster.   This is running in a shell, not designed for large jobs.   Shell is more for developing and testing parts of your code.&lt;/P&gt;&lt;P&gt;Can you upgrade to 1.6.2?   Newer Spark is faster and more efficient.&lt;/P&gt;&lt;P&gt;Here are some settings to do.&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.hortonworks.com/articles/34209/spark-16-tips-in-code-and-submission.html" target="_blank"&gt;https://community.hortonworks.com/articles/34209/spark-16-tips-in-code-and-submission.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 28 Oct 2016 01:43:05 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Broadcast-Hash-Join-failing-on-800-million-to-1-5/m-p/151635#M44608</guid>
      <dc:creator>TimothySpann</dc:creator>
      <dc:date>2016-10-28T01:43:05Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Broadcast Hash Join failing on 800+ million to 1.5 million</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Broadcast-Hash-Join-failing-on-800-million-to-1-5/m-p/151636#M44609</link>
      <description>&lt;P&gt;Run it on a cluster so you have more RAM .  Running on one machine won't support that data size&lt;/P&gt;&lt;PRE&gt;16/10/25 18:30:41 INFO BlockManager: Reporting 4 blocks to the master.
Exception in thread "qtp1394524874-84" java.lang.OutOfMemoryError: GC overhead limit exceeded
       	at java.util.HashMap$KeySet.iterator(HashMap.java:912)
       	at java.util.HashSet.iterator(HashSet.java:172)
       	at sun.nio.ch.Util$2.iterator(Util.java:243)
       	at org.spark-project.jetty.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:600)
       	at org.spark-project.jetty.io.nio.SelectorManager$1.run(SelectorManager.java:290)
       	at org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
       	at org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
       	at java.lang.Thread.run(Thread.java:745)&lt;/PRE&gt;</description>
      <pubDate>Fri, 28 Oct 2016 01:44:31 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Broadcast-Hash-Join-failing-on-800-million-to-1-5/m-p/151636#M44609</guid>
      <dc:creator>TimothySpann</dc:creator>
      <dc:date>2016-10-28T01:44:31Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Broadcast Hash Join failing on 800+ million to 1.5 million</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Broadcast-Hash-Join-failing-on-800-million-to-1-5/m-p/151637#M44610</link>
      <description>&lt;P&gt;Thanks for all your help, I'll try 32 gigs and 895 executors and run as compiled code and let u know.&lt;/P&gt;&lt;P&gt;And we are running Spark 1.5.2 here in our company. &lt;/P&gt;&lt;P&gt;Here is the cluster config.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="8937-cluster-matrics.png" style="width: 1844px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/21156iA8590240BF99B9CE/image-size/medium?v=v2&amp;amp;px=400" role="button" title="8937-cluster-matrics.png" alt="8937-cluster-matrics.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 18 Aug 2019 12:52:29 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Broadcast-Hash-Join-failing-on-800-million-to-1-5/m-p/151637#M44610</guid>
      <dc:creator>adnanalvee</dc:creator>
      <dc:date>2019-08-18T12:52:29Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Broadcast Hash Join failing on 800+ million to 1.5 million</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Broadcast-Hash-Join-failing-on-800-million-to-1-5/m-p/151638#M44611</link>
      <description>&lt;P&gt;Yes If you check my config above, I am running from a cluster. &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 28 Oct 2016 09:10:10 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Broadcast-Hash-Join-failing-on-800-million-to-1-5/m-p/151638#M44611</guid>
      <dc:creator>adnanalvee</dc:creator>
      <dc:date>2016-10-28T09:10:10Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Broadcast Hash Join failing on 800+ million to 1.5 million</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Broadcast-Hash-Join-failing-on-800-million-to-1-5/m-p/151639#M44612</link>
      <description>&lt;P&gt;Regarding the size of the data, the small dataset is only &lt;STRONG&gt;100.3 mb&lt;/STRONG&gt; while the larger dataset is &lt;STRONG&gt;15.6 gb&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 28 Oct 2016 09:46:21 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Broadcast-Hash-Join-failing-on-800-million-to-1-5/m-p/151639#M44612</guid>
      <dc:creator>adnanalvee</dc:creator>
      <dc:date>2016-10-28T09:46:21Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Broadcast Hash Join failing on 800+ million to 1.5 million</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Broadcast-Hash-Join-failing-on-800-million-to-1-5/m-p/151640#M44613</link>
      <description>&lt;P&gt;Late reply but running it on a cluster and increasing memory worked like a charm!&lt;/P&gt;</description>
      <pubDate>Wed, 22 Feb 2017 03:52:42 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Broadcast-Hash-Join-failing-on-800-million-to-1-5/m-p/151640#M44613</guid>
      <dc:creator>adnanalvee</dc:creator>
      <dc:date>2017-02-22T03:52:42Z</dc:date>
    </item>
  </channel>
</rss>

