Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

For Pyspark Sql Query delay in picking up tasks for spark executors


For Pyspark Sql Query delay in picking up tasks for spark executors


Dear All,


I am using pyspark2.3.0 . With below syntax

        pyspark2 --driver-memory 4G --executor-cores 3 --driver-cores 3 --num-executors 16 --executor-memory 4G


On execution of above statement making I am reserving the above cores mentioned


Now the problem is I am firing below query, Below mentioned table is partitioned and it is of 450GB

        spark.sql("select * from tabl_name limit 20").show()


        On firing above query we have ,There are 10000 tasks created and execution starts at end of 4th min. 


Is there any help so that I could reduce the time to wait till 4 min. Please let me know if any parameter I should be setting



Thanks and Regards,

Naveen Srikanth D



Re: For Pyspark Sql Query delay in picking up tasks for spark executors

Expert Contributor

Hi @GokuNaveen


By design, Number of tasks = Number of partitions. 

Typically, there is a partition for each HDFS block being read.


This means: 1 Spark task <--> 1 Spark partition <--> 1 HDFS block


When a job is run, Spark makes a determination of where to execute the task based on certain factors such as available memory or cores on a node, where the data is located in a cluster, or available executors.


By default, spark waits for 3ms to prefer launching a task on a node where the actual data resides. This parameter is referred to as spark.locality.wait. When you've several tasks this can be a bottleneck and can increase the overall startup time, however, remember that having a data local to the task can actually reduce the time a task takes to complete. Please explore this value (try with 0ms) and keep in mind the pros and cons of this. For a detailed discussion see this link

Also note that spark.locality.wait is only relevant when dynamic allocation is enabled (with static allocation, it has no idea what data you're even going to read when it requests containers, so it can't use any locality info). 


On a side note, you should also look at the number of partitions. With too many partitions, the task scheduling may take more time than the actual execution time. Ideally, a ratio (partitions to cores) of 2X (or 3X) should be a good place to start with. 



Don't have an account?
Coming from Hortonworks? Activate your account here