- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to decide spark submit configurations
- Labels:
-
Apache Spark
Created ‎03-01-2018 03:24 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I want to know how shall i decide upon the --executor-cores,--executor-memory,--num-executors considering i have cluster configuration as : 40 Nodes,20 cores each,100GB each.
I have a data in file of 2GB size and performing filter and aggregation function.
How much value should be given to parameters for --spark-submit command and how will it work.
(I don't want to use dynamic memory allocation for this particular case)
Created ‎03-01-2018 06:51 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Number of cores = Concurrent tasks as executor can run
So we might think, more concurrent tasks for each executor will give better performance. But research shows that any application with more than 5 concurrent tasks, would lead to bad show. So stick this to 5.
Coming back to next step, with 5 as cores per executor, and 19 as total available cores in one Node(CPU) - we come to ~4 executors per node.
So memory for each executor is 98/4 = ~24GB.
Calculating that overhead - .07 * 24 (Here 24 is calculated as above) = 1.68
Since 1.68 GB > 384 MB, the over head is 1.68.
Take the above from each 21 above => 24 - 1.68 ~ 22 GB
Created ‎03-01-2018 06:51 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Number of cores = Concurrent tasks as executor can run
So we might think, more concurrent tasks for each executor will give better performance. But research shows that any application with more than 5 concurrent tasks, would lead to bad show. So stick this to 5.
Coming back to next step, with 5 as cores per executor, and 19 as total available cores in one Node(CPU) - we come to ~4 executors per node.
So memory for each executor is 98/4 = ~24GB.
Calculating that overhead - .07 * 24 (Here 24 is calculated as above) = 1.68
Since 1.68 GB > 384 MB, the over head is 1.68.
Take the above from each 21 above => 24 - 1.68 ~ 22 GB
Created ‎03-04-2018 04:56 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you @Vikas Srivastava for your inputs but i would like to know how my input data size will affect my configuration.considering we will have other jobs also running in cluster and i want to use enough configuration for my 2GB input only.
Created ‎03-04-2018 03:13 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In your case, if you try to run it on yarn, you can use the minimum of 1G as well like this
--master yarn-client --executor-memory 1G --executor-cores 2 --num-executors 12
you can increase the number of executors to make it more better 🙂
Created ‎01-05-2020 06:17 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Hope this below links helps in deciding the Configurations apart from the previous comments
https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-2/
https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-1/
Thanks
AKR
