<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Using Spark in Hive error GC overhead limit exceeded in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Using-Spark-in-Hive-error-GC-overhead-limit-exceeded/m-p/241973#M203776</link>
    <description>&lt;P&gt;Hello all, &lt;BR /&gt;I'm trying to  config Hiveserver2 use Spark and it's working perfect with small file. But with large file ( ~ 1.5GB ) , it will be crash by "GC overhead limit exceeded" . &lt;BR /&gt;&lt;BR /&gt;My flow is simple like this :&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;1. Load data from text file into table_text ( text file ~ 1.5G )&lt;BR /&gt;Sql: load data local path 'home/abc.txt' into table table_text;&lt;BR /&gt;2. select data from table_text to insert to table_orc ( crash in this flow )&lt;/P&gt;&lt;P&gt;SQL : Insert into table table_orc select id,time,data,path,size from table_text;&lt;/P&gt;&lt;P&gt;I guess spark have to load all data from table_text and save it in memory before insert to table_orc . I researched and know that spark can config if data does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed ( RDD Persistence ).&lt;BR /&gt;&lt;BR /&gt;My environment:&lt;BR /&gt;Ubuntu 16.04&lt;BR /&gt;Hive version : 2.3.0&lt;/P&gt;&lt;P&gt;Free memory when launch sql : 4G&lt;/P&gt;&lt;P&gt;My config in hive-site.xml: &lt;/P&gt;&lt;PRE&gt;&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;hive.execution.engine&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;spark&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;
&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;spark.master&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;local[*]&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;
&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;spark.eventLog.enabled&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;true&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;
&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;spark.driver.memory&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;12G&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;
&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;spark.executor.memory&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;12G&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;
&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;spark.serializer&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;org.apache.spark.serializer.KryoSerializer&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;
&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;spark.yarn.jars&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;/home/cpu60020-local/Documents/Setup/Java/server/spark/jars/*&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;
&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;spark.eventLog.enabled&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;false&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;
&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;spark.eventLog.dir&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;/home/cpu60020-local/Documents/Setup/Hive/apache-hive-2.3.0-bin/log/&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;&lt;/PRE&gt;&lt;P&gt;Please tell me if you have any suggess , thanks all !&lt;/P&gt;</description>
    <pubDate>Fri, 11 Jan 2019 14:18:27 GMT</pubDate>
    <dc:creator>thanhlv93</dc:creator>
    <dc:date>2019-01-11T14:18:27Z</dc:date>
    <item>
      <title>Using Spark in Hive error GC overhead limit exceeded</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Using-Spark-in-Hive-error-GC-overhead-limit-exceeded/m-p/241973#M203776</link>
      <description>&lt;P&gt;Hello all, &lt;BR /&gt;I'm trying to  config Hiveserver2 use Spark and it's working perfect with small file. But with large file ( ~ 1.5GB ) , it will be crash by "GC overhead limit exceeded" . &lt;BR /&gt;&lt;BR /&gt;My flow is simple like this :&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;1. Load data from text file into table_text ( text file ~ 1.5G )&lt;BR /&gt;Sql: load data local path 'home/abc.txt' into table table_text;&lt;BR /&gt;2. select data from table_text to insert to table_orc ( crash in this flow )&lt;/P&gt;&lt;P&gt;SQL : Insert into table table_orc select id,time,data,path,size from table_text;&lt;/P&gt;&lt;P&gt;I guess spark have to load all data from table_text and save it in memory before insert to table_orc . I researched and know that spark can config if data does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed ( RDD Persistence ).&lt;BR /&gt;&lt;BR /&gt;My environment:&lt;BR /&gt;Ubuntu 16.04&lt;BR /&gt;Hive version : 2.3.0&lt;/P&gt;&lt;P&gt;Free memory when launch sql : 4G&lt;/P&gt;&lt;P&gt;My config in hive-site.xml: &lt;/P&gt;&lt;PRE&gt;&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;hive.execution.engine&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;spark&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;
&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;spark.master&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;local[*]&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;
&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;spark.eventLog.enabled&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;true&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;
&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;spark.driver.memory&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;12G&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;
&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;spark.executor.memory&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;12G&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;
&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;spark.serializer&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;org.apache.spark.serializer.KryoSerializer&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;
&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;spark.yarn.jars&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;/home/cpu60020-local/Documents/Setup/Java/server/spark/jars/*&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;
&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;spark.eventLog.enabled&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;false&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;
&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;spark.eventLog.dir&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;/home/cpu60020-local/Documents/Setup/Hive/apache-hive-2.3.0-bin/log/&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;&lt;/PRE&gt;&lt;P&gt;Please tell me if you have any suggess , thanks all !&lt;/P&gt;</description>
      <pubDate>Fri, 11 Jan 2019 14:18:27 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Using-Spark-in-Hive-error-GC-overhead-limit-exceeded/m-p/241973#M203776</guid>
      <dc:creator>thanhlv93</dc:creator>
      <dc:date>2019-01-11T14:18:27Z</dc:date>
    </item>
    <item>
      <title>Re: Using Spark in Hive error GC overhead limit exceeded</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Using-Spark-in-Hive-error-GC-overhead-limit-exceeded/m-p/241974#M203777</link>
      <description>&lt;P&gt;After increase heapsize in hive-env.sh to 4G , it's working perfect without OOM.&lt;BR /&gt;&lt;BR /&gt;export HADOOP_HEAPSIZE=4096&lt;/P&gt;</description>
      <pubDate>Fri, 11 Jan 2019 17:42:30 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Using-Spark-in-Hive-error-GC-overhead-limit-exceeded/m-p/241974#M203777</guid>
      <dc:creator>thanhlv93</dc:creator>
      <dc:date>2019-01-11T17:42:30Z</dc:date>
    </item>
  </channel>
</rss>

