Hello all,
I'm trying to config Hiveserver2 use Spark and it's working perfect with small file. But with large file ( ~ 1.5GB ) , it will be crash by "GC overhead limit exceeded" .
My flow is simple like this :
1. Load data from text file into table_text ( text file ~ 1.5G )
Sql: load data local path 'home/abc.txt' into table table_text;
2. select data from table_text to insert to table_orc ( crash in this flow )
SQL : Insert into table table_orc select id,time,data,path,size from table_text;
I guess spark have to load all data from table_text and save it in memory before insert to table_orc . I researched and know that spark can config if data does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed ( RDD Persistence ).
My environment:
Ubuntu 16.04
Hive version : 2.3.0
Free memory when launch sql : 4G
My config in hive-site.xml:
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
<property>
<name>spark.master</name>
<value>local[*]</value>
</property>
<property>
<name>spark.eventLog.enabled</name>
<value>true</value>
</property>
<property>
<name>spark.driver.memory</name>
<value>12G</value>
</property>
<property>
<name>spark.executor.memory</name>
<value>12G</value>
</property>
<property>
<name>spark.serializer</name>
<value>org.apache.spark.serializer.KryoSerializer</value>
</property>
<property>
<name>spark.yarn.jars</name>
<value>/home/cpu60020-local/Documents/Setup/Java/server/spark/jars/*</value>
</property>
<property>
<name>spark.eventLog.enabled</name>
<value>false</value>
</property>
<property>
<name>spark.eventLog.dir</name>
<value>/home/cpu60020-local/Documents/Setup/Hive/apache-hive-2.3.0-bin/log/</value>
</property>
Please tell me if you have any suggess , thanks all !