Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to decide spark submit configurations

Solved Go to solution
Highlighted

How to decide spark submit configurations

New Contributor

I want to know how shall i decide upon the --executor-cores,--executor-memory,--num-executors considering i have cluster configuration as : 40 Nodes,20 cores each,100GB each.

I have a data in file of 2GB size and performing filter and aggregation function.

How much value should be given to parameters for --spark-submit command and how will it work.

(I don't want to use dynamic memory allocation for this particular case)

1 ACCEPTED SOLUTION

Accepted Solutions

Re: How to decide spark submit configurations

Contributor

Number of cores = Concurrent tasks as executor can run

So we might think, more concurrent tasks for each executor will give better performance. But research shows that any application with more than 5 concurrent tasks, would lead to bad show. So stick this to 5.

Coming back to next step, with 5 as cores per executor, and 19 as total available cores in one Node(CPU) - we come to ~4 executors per node.

So memory for each executor is 98/4 = ~24GB.

Calculating that overhead - .07 * 24 (Here 24 is calculated as above) = 1.68

Since 1.68 GB > 384 MB, the over head is 1.68.

Take the above from each 21 above => 24 - 1.68 ~ 22 GB

3 REPLIES 3

Re: How to decide spark submit configurations

Contributor

Number of cores = Concurrent tasks as executor can run

So we might think, more concurrent tasks for each executor will give better performance. But research shows that any application with more than 5 concurrent tasks, would lead to bad show. So stick this to 5.

Coming back to next step, with 5 as cores per executor, and 19 as total available cores in one Node(CPU) - we come to ~4 executors per node.

So memory for each executor is 98/4 = ~24GB.

Calculating that overhead - .07 * 24 (Here 24 is calculated as above) = 1.68

Since 1.68 GB > 384 MB, the over head is 1.68.

Take the above from each 21 above => 24 - 1.68 ~ 22 GB

Re: How to decide spark submit configurations

New Contributor

Thank you @Vikas Srivastava for your inputs but i would like to know how my input data size will affect my configuration.considering we will have other jobs also running in cluster and i want to use enough configuration for my 2GB input only.

Re: How to decide spark submit configurations

Contributor

In your case, if you try to run it on yarn, you can use the minimum of 1G as well like this

--master yarn-client --executor-memory 1G --executor-cores 2 --num-executors 12
you can increase the number of executors to make it more better :)