I am trying to read out a tomcat log file (size is around 5 gig ) and store those data in HIVE in spark. After reading out log file my dataframe size around 100K. But when I am trying to insert them in Hive I am getting "java.lang.OutOfMemoryError: Java heap space" error in driver. Code is something like this ...
spark.sql("insert into table com.pointsData select * from temptable")
where "temptable" in my dataframe in spark.
Any one can help me out with any work around ? Anything like , I can split the DF and run insert into in small chuck?
Please note I am using maximum of my driver system memory , I can not increase it any more and I am using Kyro.
Thanks in advance...
"java.lang.OutOfMemoryError: Java heap space" error in driver. Code is something like this ...
Can you please share some more info . Why is data flowing back to your driver resulting in OOM ?
Actually my requirement is some thing like that...
1) Read the data from file
2) do some filter operation on those data
3) store them back in HIVE for other application
4) View those data in Zapplion from HIVE
One way of doing this is to reduce the no of records which is flowing from spark to hive. Use a filter condition and reduce the no of records flowing and try to insert in multiple inserts. That should work and reduce the record flowing into memory. Also when you are inserting from spark to hive im not sure why but there are high chances that the data moving to shuffles might be very high. If possible attach the complete logs.Hope it helps!!
@Bala , sorry for vary late response ....
Actually my purpose is read some data file(server log) , transform those into proper format and prepare a data warehouse (that in my case , HIVE) for analysis on latter.
So , in my project I have 3 different activities mainly
1) read and transform data from txt/log file (For which I am using Spark -- frequency : daily job)
2) prepare a data-ware house with those daily data (for which , I am inserting those Spark DF into HIVE table --- frequency : daily job)
3) Show the result (for this I am using again spark SQL along with HIVE as that is faster than using only HIVE query , and will use
Zeppelin or tableau for data visualization --frequency : weekly job or as on required )
Though as my reading and understanding , I guess SpakSQL alone + cache will be much faster the spark+hive , but I think I do ont have any other option as I have to do analysis on repository data.
Do you suggest any other approach for this use case?