Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to decide spark submit configurations

avatar

I want to know how shall i decide upon the --executor-cores,--executor-memory,--num-executors considering i have cluster configuration as : 40 Nodes,20 cores each,100GB each.

I have a data in file of 2GB size and performing filter and aggregation function.

How much value should be given to parameters for --spark-submit command and how will it work.

(I don't want to use dynamic memory allocation for this particular case)

1 ACCEPTED SOLUTION

avatar
Contributor

Number of cores = Concurrent tasks as executor can run

So we might think, more concurrent tasks for each executor will give better performance. But research shows that any application with more than 5 concurrent tasks, would lead to bad show. So stick this to 5.

Coming back to next step, with 5 as cores per executor, and 19 as total available cores in one Node(CPU) - we come to ~4 executors per node.

So memory for each executor is 98/4 = ~24GB.

Calculating that overhead - .07 * 24 (Here 24 is calculated as above) = 1.68

Since 1.68 GB > 384 MB, the over head is 1.68.

Take the above from each 21 above => 24 - 1.68 ~ 22 GB

View solution in original post

4 REPLIES 4

avatar
Contributor

Number of cores = Concurrent tasks as executor can run

So we might think, more concurrent tasks for each executor will give better performance. But research shows that any application with more than 5 concurrent tasks, would lead to bad show. So stick this to 5.

Coming back to next step, with 5 as cores per executor, and 19 as total available cores in one Node(CPU) - we come to ~4 executors per node.

So memory for each executor is 98/4 = ~24GB.

Calculating that overhead - .07 * 24 (Here 24 is calculated as above) = 1.68

Since 1.68 GB > 384 MB, the over head is 1.68.

Take the above from each 21 above => 24 - 1.68 ~ 22 GB

avatar

Thank you @Vikas Srivastava for your inputs but i would like to know how my input data size will affect my configuration.considering we will have other jobs also running in cluster and i want to use enough configuration for my 2GB input only.

avatar
Contributor

In your case, if you try to run it on yarn, you can use the minimum of 1G as well like this

--master yarn-client --executor-memory 1G --executor-cores 2 --num-executors 12
you can increase the number of executors to make it more better 🙂

avatar
Cloudera Employee

Hi,

 

Hope this below links helps in deciding the Configurations apart from the previous comments

 

https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-2/

https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-1/

 

Thanks

AKR