06-22-2017 06:22 AM
We have Spark job that we need to increase it's open files from 300 to 1200, what is the impact on the cores and memory that configured fro the job?
Is there any estimation equation of open files for the cores?
06-22-2017 09:46 AM
06-23-2017 06:21 AM
I think what you are looking for is the number of tasks that each executor can handle. Tasks won't coorespond directly with files but increasing the number of tasks per executor or increasing the number of executors will boost parallelism.
The number of cores per executor will determine the number of tasks that an executor can handle at a time. It will wait until a tasks has finished before taking on a new one. So you need to find the right number of cores that provide the best amount of tasks running on an executor with the right amount of memory per tasks to handle your data. So lets say your job launches 1200 tasks. You could configure it with 240 executors with 5 cores per executor. This will allow all tasks to run in parallel at the same time. If 240 is too much, drop it down but expect a slower run time for the job.
Tip: I used 5 in the example as other have shown that more than five provides diminish returns of HDFS throughput.
06-24-2017 08:18 PM
@mbigelow I read the best practise and fine tune my spark job, No i'm fine with executors and cores per executors and memory, i had a spark job that ran with 50X1X8G and has 600 open files and was running fine, when i increase the open files to 1200 the job start to fail, i'm all the time trying to find the suitable executors and cores per executors.
BTW, finidng the right configuration isn't a simple task and sometime i find my self doing try and catch.