Support Questions
Find answers, ask questions, and share your expertise

Spark2 Tuning running on Yarn

Explorer

I have 5 worker nodes in Hortonworks cluster with 15 cores per worker node for Yarn and 1 TB Yarn memory in total with default spark config values as of now i.e 1 G executor memory etc.

1. Which parameters should I tune for maximum utilization of spark job for each user ?

2. What should be the recommended values for max number of executors, executor memory and executor per cores for above Hardware configuration

Note : Dynamic resource allocation is already enabled.

Any tips on tuning spark jobs are always welcomed. @Guilherme Braccialli @Andrew Watson

2 REPLIES 2

Hi @Sushant,

Regarding the spark parameters: the perfect setting is very depending of the characteristics of each spark job. You can get some good defaults using this Apache Spark Config Cheat sheet: http://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/

Also take a look at dr-elephant, a performance monitoring and tuning tool for Apache Hadoop.

Super Collaborator

Hi @Sushant,

There is no rule of thumb or common practice how much resources a job should be given, that's purely relies on the job type and resource availability,

but the principle is that you need to look into job execution plan carefully and allocate the resources depends on the scenarios mentioned below,

Number of executors :

If there are any parallel steps waiting for execution though the predecessor completes the task, then you must consider increasing the executors, so that all nondependent steps can be executed in parallel.

Number of Cores per executor :

There are cases the compute time of execution is longer, then you need to increase the executor cores.

Executor Memory :

in case ,if you find any of the GC errors and task is being resubmitted(you can find some of the tasks being skip for previous run) then you need to look into the executor memory and increase it.

again please be mindful that there are cases if you don't un persist your RDD's or Data frames you might come across increase in GC time.

the best practice is to look into your execution plan carefully and optimize the resource allocation.

you might find the following benchmark presentation helpful to understand more of this.

https://www.slideshare.net/HadoopSummit/running-spark-in-production-61337353