Support Questions

Find answers, ask questions, and share your expertise

spark.yarn.executor.memoryOverhead...

avatar
New Contributor

Does anyone know exactly what spark.yarn.executor.memoryOverhead is used for and why it may be using up so much space? If I could, I would love to have a peek inside this stack. Spark's description is as follows:

The amount of off-heap memory (in megabytes) to be allocated per executor. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%).

The problem I'm having is when running spark queries on large datasets ( > 5TB), I am required to set the executor memoryOverhead to 8GB otherwise it would throw an exception and die. What is being stored in this container that it needs 8GB per container?

I've also noticed that this error doesn't occur on standalone mode, because it doesn't use YARN.

Note. My configurations for this job are:

executor memory = 15G executor cores = 5 yarn.executor.memoryOverhead = 8GB max executors = 60 offHeap.enabled = false

5 REPLIES 5

avatar

Hi Henry,

you may be interested by this article: http://www.wdong.org/wordpress/blog/2015/01/08/spark-on-yarn-where-have-all-my-memory-gone/

The link seems to be dead at the moment (here is a cached version: http://m.blog.csdn.net/article/details?id=50387104)

avatar
New Contributor

Thanks. What blows my mine is this statement from the article OVERHEAD = max(SPECIFIED_MEMORY * 0.07, 384M)

If I'm allocating 8GB for memoryOverhead, then OVERHEAD = 567 MB !!

What is yarn using the other 7.5 GB for?

avatar

@Henry : I think that equation uses the executor memory (in your case, 15G) and outputs the overhead value.

// Below calculation uses executorMemory, not memoryOverhead
math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, MEMORY_OVERHEAD_MIN))

The goal is to calculate OVERHEAD as a percentage of real executor memory, as used by RDDs and DataFrames. I will add that when using Spark on Yarn, the Yarn configuration settings have to be adjusted and tweaked to match up carefully with the Spark properties (as the referenced blog suggests). You might also want to look at Tiered Storage to offload RDDs into MEM_AND_DISK, etc.

avatar

Hi Henry:

Since you are requesting 15G for each executor, you may want to increase the size of Java Heap space for the Spark executors, as allocated using this parameter:

spark.executor.extraJavaOptions='-Xmx24g'

avatar
New Contributor

Per recent Spark docs, you can't actually set the heap size that way. You need to use `spark.executor.memory` to do so. https://spark.apache.org/docs/2.1.1/configuration.html#runtime-environment

,

Just an FYI, Spark 2.1.1 doesn't allow setting the heap space in `extraJavaOptions`:

java.lang.Exception: spark.executor.extraJavaOptions is not allowed to specify max heap memory settings (was ''-Xmx20g''). Use spark.executor.memory instead.