Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark2 Tuning running on Yarn

Highlighted

Spark2 Tuning running on Yarn

New Contributor

I have 5 worker nodes in Hortonworks cluster with 15 cores per worker node for Yarn and 1 TB Yarn memory in total with default spark config values as of now i.e 1 G executor memory etc.

1. Which parameters should I tune for maximum utilization of spark job for each user ?

2. What should be the recommended values for max number of executors, executor memory and executor per cores for above Hardware configuration

Note : Dynamic resource allocation is already enabled.

Any tips on tuning spark jobs are always welcomed. @Guilherme Braccialli @Andrew Watson

2 REPLIES 2

Re: Spark2 Tuning running on Yarn

Hi @Sushant,

Regarding the spark parameters: the perfect setting is very depending of the characteristics of each spark job. You can get some good defaults using this Apache Spark Config Cheat sheet: http://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/

Also take a look at dr-elephant, a performance monitoring and tuning tool for Apache Hadoop.

Re: Spark2 Tuning running on Yarn

Super Collaborator

Hi @Sushant,

There is no rule of thumb or common practice how much resources a job should be given, that's purely relies on the job type and resource availability,

but the principle is that you need to look into job execution plan carefully and allocate the resources depends on the scenarios mentioned below,

Number of executors :

If there are any parallel steps waiting for execution though the predecessor completes the task, then you must consider increasing the executors, so that all nondependent steps can be executed in parallel.

Number of Cores per executor :

There are cases the compute time of execution is longer, then you need to increase the executor cores.

Executor Memory :

in case ,if you find any of the GC errors and task is being resubmitted(you can find some of the tasks being skip for previous run) then you need to look into the executor memory and increase it.

again please be mindful that there are cases if you don't un persist your RDD's or Data frames you might come across increase in GC time.

the best practice is to look into your execution plan carefully and optimize the resource allocation.

you might find the following benchmark presentation helpful to understand more of this.

https://www.slideshare.net/HadoopSummit/running-spark-in-production-61337353

Don't have an account?
Coming from Hortonworks? Activate your account here