Created 03-01-2018 03:24 AM
I want to know how shall i decide upon the --executor-cores,--executor-memory,--num-executors considering i have cluster configuration as : 40 Nodes,20 cores each,100GB each.
I have a data in file of 2GB size and performing filter and aggregation function.
How much value should be given to parameters for --spark-submit command and how will it work.
(I don't want to use dynamic memory allocation for this particular case)
Created 03-01-2018 06:51 PM
Number of cores = Concurrent tasks as executor can run
So we might think, more concurrent tasks for each executor will give better performance. But research shows that any application with more than 5 concurrent tasks, would lead to bad show. So stick this to 5.
Coming back to next step, with 5 as cores per executor, and 19 as total available cores in one Node(CPU) - we come to ~4 executors per node.
So memory for each executor is 98/4 = ~24GB.
Calculating that overhead - .07 * 24 (Here 24 is calculated as above) = 1.68
Since 1.68 GB > 384 MB, the over head is 1.68.
Take the above from each 21 above => 24 - 1.68 ~ 22 GB
Created 03-01-2018 06:51 PM
Number of cores = Concurrent tasks as executor can run
So we might think, more concurrent tasks for each executor will give better performance. But research shows that any application with more than 5 concurrent tasks, would lead to bad show. So stick this to 5.
Coming back to next step, with 5 as cores per executor, and 19 as total available cores in one Node(CPU) - we come to ~4 executors per node.
So memory for each executor is 98/4 = ~24GB.
Calculating that overhead - .07 * 24 (Here 24 is calculated as above) = 1.68
Since 1.68 GB > 384 MB, the over head is 1.68.
Take the above from each 21 above => 24 - 1.68 ~ 22 GB
Created 03-04-2018 04:56 AM
Thank you @Vikas Srivastava for your inputs but i would like to know how my input data size will affect my configuration.considering we will have other jobs also running in cluster and i want to use enough configuration for my 2GB input only.
Created 03-04-2018 03:13 PM
In your case, if you try to run it on yarn, you can use the minimum of 1G as well like this
--master yarn-client --executor-memory 1G --executor-cores 2 --num-executors 12
you can increase the number of executors to make it more better 🙂
Created 01-05-2020 06:17 AM
Hi,
Hope this below links helps in deciding the Configurations apart from the previous comments
https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-2/
https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-1/
Thanks
AKR