Can someone explain following lines about
Below written statement on hortonworks site (https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_spark-guide/content/ch_tuning-spark.html#spark-job-status):
There are tradeoffs between
executor-memory. Large executor memory does not imply better performance, due to JVM garbage collection. Sometimes it is better to configur a larger number of small JVMs than a small number of large JVMs.
Lets try to go through the anatomy of spark on Yarn
1. YARN containers are spawned on Physcial nodes
2. Spark AM request for fixed number of YARN container and run Spark tasks(Map, reduce, filter) within them known as
3. Each task run as thread with the executor.
4. Container -> spark executor -> task threads.
5. depending on the the input param
executor-memory each executor is spawned as a JVM with the defined heap size.
6. This JVM memory is used for running the Task threads and keeping data in-memory too. GC is also happening because of memory allocation and de-allocation
So overall executor are JVM running multiple threads, and the question boils down to , will a JVM of infinite size has best performance . And the answer is no , because JVM performance is overall factor of wise memory allocation to minimize Garbage collection.
Huge JVM starts to perform better initially. as the process starts there is huge amount of memory available for allocation , eventually memory starts getting full and more time is spend on doing GC. classical Bell curve case of performance and memory.
Huge JVM with time results into de-fragmentation and long pause time occurs when full GC happens. The solution is to have a fine balance between
num-executor and executor-memory