I am trying to read out a tomcat log file (size is around 5 gig ) and store those data in HIVE in spark. After reading out log file my dataframe size around 100K. But when I am trying to insert them in Hive I am getting "java.lang.OutOfMemoryError: Java heap space" error in driver. Code is something like this ...
spark.sql("insert into table com.pointsData select * from temptable")
where "temptable" in my dataframe in spark.
Any one can help me out with any work around ? Anything like , I can split the DF and run insert into in small chuck?
Please note I am using maximum of my driver system memory , I can not increase it any more and I am using Kyro.
One way of doing this is to reduce the no of records which is flowing from spark to hive. Use a filter condition and reduce the no of records flowing and try to insert in multiple inserts. That should work and reduce the record flowing into memory. Also when you are inserting from spark to hive im not sure why but there are high chances that the data moving to shuffles might be very high. If possible attach the complete logs.Hope it helps!!
Actually my purpose
is read some data file(server log) , transform those into proper format
and prepare a data warehouse (that in my case , HIVE) for analysis on
So , in my project I have 3 different activities mainly
1) read and transform data from txt/log file (For which I am using Spark -- frequency : daily job) 2)
prepare a data-ware house with those daily data (for which , I am
inserting those Spark DF into HIVE table --- frequency : daily job)
Show the result (for this I am using again spark SQL along with HIVE as
that is faster than using only HIVE query , and will use
Zeppelin or tableau for data visualization --frequency : weekly job or as on required )
as my reading and understanding , I guess SpakSQL alone + cache will be
much faster the spark+hive , but I think I do ont have any other option as I
have to do analysis on repository data.
Do you suggest any other approach for this use case?