Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Using Spark in Hive error GC overhead limit exceeded

avatar
New Contributor

Hello all,
I'm trying to config Hiveserver2 use Spark and it's working perfect with small file. But with large file ( ~ 1.5GB ) , it will be crash by "GC overhead limit exceeded" .

My flow is simple like this :

1. Load data from text file into table_text ( text file ~ 1.5G )
Sql: load data local path 'home/abc.txt' into table table_text;
2. select data from table_text to insert to table_orc ( crash in this flow )

SQL : Insert into table table_orc select id,time,data,path,size from table_text;

I guess spark have to load all data from table_text and save it in memory before insert to table_orc . I researched and know that spark can config if data does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed ( RDD Persistence ).

My environment:
Ubuntu 16.04
Hive version : 2.3.0

Free memory when launch sql : 4G

My config in hive-site.xml:

<property>
  <name>hive.execution.engine</name>
  <value>spark</value>
</property>
<property>
  <name>spark.master</name>
  <value>local[*]</value>
</property>
<property>
  <name>spark.eventLog.enabled</name>
  <value>true</value>
</property>
<property>
  <name>spark.driver.memory</name>
  <value>12G</value>
</property>
<property>
  <name>spark.executor.memory</name>
  <value>12G</value>
</property>
<property>
  <name>spark.serializer</name>
  <value>org.apache.spark.serializer.KryoSerializer</value>
</property>
<property>
  <name>spark.yarn.jars</name>
  <value>/home/cpu60020-local/Documents/Setup/Java/server/spark/jars/*</value>
</property>
<property>
  <name>spark.eventLog.enabled</name>
  <value>false</value>
</property>
<property>
  <name>spark.eventLog.dir</name>
  <value>/home/cpu60020-local/Documents/Setup/Hive/apache-hive-2.3.0-bin/log/</value>
</property>

Please tell me if you have any suggess , thanks all !

1 ACCEPTED SOLUTION

avatar
New Contributor

After increase heapsize in hive-env.sh to 4G , it's working perfect without OOM.

export HADOOP_HEAPSIZE=4096

View solution in original post

1 REPLY 1

avatar
New Contributor

After increase heapsize in hive-env.sh to 4G , it's working perfect without OOM.

export HADOOP_HEAPSIZE=4096