I'm trying to run a Spark application on YARN in a single node instance with 32G RAM. Its working well for a small dataset. But for a bigger table its failing with this error:
Thanks for your reply.
You are right. I saw this in executor logs:
Exception in thread "qtp1529675476-45" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Spark Context Cleaner" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: GC overhead limit exceeded
What can I do to fix this?
I'm using Spark on YARN and spark memory allocation is dynamic.
Also my Hive table is around 70G. Does it mean that I need 70G memory for spark to process them?
In your Spark UI do you see it working with a large number of partitions (large number of tasks)? It could be that you are loading all 70G into memory at once if you have a small number of partitions.
Also it could be that you have one huge partition with 99% of the data and lots of small ones. Then when Spark processes your huge partition it will load it all into memory. This can happen if you are mapping to a tuple e.g. (x, y) and the key (x) is the same for 99% of the data.
Have a look at your Spark UI to see the size of the tasks you are running. It's likely that you will see a small number of tasks, or one huge task and a lot of small ones.