Using Spark in Hive error GC overhead limit exceeded

thanhlv93 — Fri, 11 Jan 2019 14:18:27 GMT

Hello all,
I'm trying to config Hiveserver2 use Spark and it's working perfect with small file. But with large file ( ~ 1.5GB ) , it will be crash by "GC overhead limit exceeded" .

My flow is simple like this :

1. Load data from text file into table_text ( text file ~ 1.5G )
Sql: load data local path 'home/abc.txt' into table table_text;
2. select data from table_text to insert to table_orc ( crash in this flow )

SQL : Insert into table table_orc select id,time,data,path,size from table_text;

I guess spark have to load all data from table_text and save it in memory before insert to table_orc . I researched and know that spark can config if data does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed ( RDD Persistence ).

My environment:
Ubuntu 16.04
Hive version : 2.3.0

Free memory when launch sql : 4G

My config in hive-site.xml:

<property>
  <name>hive.execution.engine</name>
  <value>spark</value>
</property>
<property>
  <name>spark.master</name>
  <value>local[*]</value>
</property>
<property>
  <name>spark.eventLog.enabled</name>
  <value>true</value>
</property>
<property>
  <name>spark.driver.memory</name>
  <value>12G</value>
</property>
<property>
  <name>spark.executor.memory</name>
  <value>12G</value>
</property>
<property>
  <name>spark.serializer</name>
  <value>org.apache.spark.serializer.KryoSerializer</value>
</property>
<property>
  <name>spark.yarn.jars</name>
  <value>/home/cpu60020-local/Documents/Setup/Java/server/spark/jars/*</value>
</property>
<property>
  <name>spark.eventLog.enabled</name>
  <value>false</value>
</property>
<property>
  <name>spark.eventLog.dir</name>
  <value>/home/cpu60020-local/Documents/Setup/Hive/apache-hive-2.3.0-bin/log/</value>
</property>

Please tell me if you have any suggess , thanks all !

Re: Using Spark in Hive error GC overhead limit exceeded

thanhlv93 — Fri, 11 Jan 2019 17:42:30 GMT

After increase heapsize in hive-env.sh to 4G , it's working perfect without OOM.

export HADOOP_HEAPSIZE=4096

question Using Spark in Hive error GC overhead limit exceeded in Support Questions

Using Spark in Hive error GC overhead limit exceeded

Re: Using Spark in Hive error GC overhead limit exceeded