I got a 40 node cdh 5.1 cluster and attempting to run a simple spark app that processes about 10-15GB raw data but I keep running into this error:
java.lang.OutOfMemoryError: GC overhead limit exceeded
Each node has 8 cores and 2GB memory. I notice the heap size on the executors is set to 512MB with total set to 2GB. Wanted to know whats the heap size needs to be set to for such data sizes?
Thanks for the input!
Where does the out of memory exception occur? in your driver, or an executor? I assume it is an executor. Yes, you are using the default of 512MB per executor. You can raise that with properties like spark.executor.memory, or flags like --executor-memory if using spark-shell.
It sounds like your workers are allocating 2GB for executors, so you could potentially use up to 2GB per executor and your 1 executor per machine would consume all of your Spark cluster memory.
But more memory doesn't necessarily help if you're performing some operation that inherently allocates a great deal of memory. I'm not sure what your operations are. Keep in mind too that if you are caching RDDs in memory, this is taking memory away from what's available for computations.
Thanks Sean.. I'm currently computing uniques visitors per page and running a count distinct using SparkSQL. We also run the non-spark jobs on the cluster, so if we allocate the 2GB I'm assuming we can't run any other jobs simultaneously. Also, I'm also looking to see how to set the storage levels in CM.