Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Using Spark in Hive error GC overhead limit exceeded

Solved Go to solution
Highlighted

Using Spark in Hive error GC overhead limit exceeded

New Contributor

Hello all,
I'm trying to config Hiveserver2 use Spark and it's working perfect with small file. But with large file ( ~ 1.5GB ) , it will be crash by "GC overhead limit exceeded" .

My flow is simple like this :

1. Load data from text file into table_text ( text file ~ 1.5G )
Sql: load data local path 'home/abc.txt' into table table_text;
2. select data from table_text to insert to table_orc ( crash in this flow )

SQL : Insert into table table_orc select id,time,data,path,size from table_text;

I guess spark have to load all data from table_text and save it in memory before insert to table_orc . I researched and know that spark can config if data does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed ( RDD Persistence ).

My environment:
Ubuntu 16.04
Hive version : 2.3.0

Free memory when launch sql : 4G

My config in hive-site.xml:

<property>
  <name>hive.execution.engine</name>
  <value>spark</value>
</property>
<property>
  <name>spark.master</name>
  <value>local[*]</value>
</property>
<property>
  <name>spark.eventLog.enabled</name>
  <value>true</value>
</property>
<property>
  <name>spark.driver.memory</name>
  <value>12G</value>
</property>
<property>
  <name>spark.executor.memory</name>
  <value>12G</value>
</property>
<property>
  <name>spark.serializer</name>
  <value>org.apache.spark.serializer.KryoSerializer</value>
</property>
<property>
  <name>spark.yarn.jars</name>
  <value>/home/cpu60020-local/Documents/Setup/Java/server/spark/jars/*</value>
</property>
<property>
  <name>spark.eventLog.enabled</name>
  <value>false</value>
</property>
<property>
  <name>spark.eventLog.dir</name>
  <value>/home/cpu60020-local/Documents/Setup/Hive/apache-hive-2.3.0-bin/log/</value>
</property>

Please tell me if you have any suggess , thanks all !

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Using Spark in Hive error GC overhead limit exceeded

New Contributor

After increase heapsize in hive-env.sh to 4G , it's working perfect without OOM.

export HADOOP_HEAPSIZE=4096

1 REPLY 1

Re: Using Spark in Hive error GC overhead limit exceeded

New Contributor

After increase heapsize in hive-env.sh to 4G , it's working perfect without OOM.

export HADOOP_HEAPSIZE=4096