Hello all, I'm trying to config Hiveserver2 use Spark and it's working perfect with small file. But with large file ( ~ 1.5GB ) , it will be crash by "GC overhead limit exceeded" .
My flow is simple like this :
1. Load data from text file into table_text ( text file ~ 1.5G ) Sql: load data local path 'home/abc.txt' into table table_text; 2. select data from table_text to insert to table_orc ( crash in this flow )
SQL : Insert into table table_orc select id,time,data,path,size from table_text;
I guess spark have to load all data from table_text and save it in memory before insert to table_orc . I researched and know that spark can config if data does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed ( RDD Persistence ).