Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

For Pyspark Sql Query delay in picking up tasks for spark executors

Explorer

Dear All,

 

I am using pyspark2.3.0 . With below syntax

        pyspark2 --driver-memory 4G --executor-cores 3 --driver-cores 3 --num-executors 16 --executor-memory 4G

 

On execution of above statement making I am reserving the above cores mentioned

 

Now the problem is I am firing below query, Below mentioned table is partitioned and it is of 450GB

        spark.sql("select * from tabl_name limit 20").show()

 

        On firing above query we have ,There are 10000 tasks created and execution starts at end of 4th min. 

 

Is there any help so that I could reduce the time to wait till 4 min. Please let me know if any parameter I should be setting

 

 

Thanks and Regards,

Naveen Srikanth D

 

1 REPLY 1

Expert Contributor

Hi @GokuNaveen

 

By design, Number of tasks = Number of partitions. 

Typically, there is a partition for each HDFS block being read.

 

This means: 1 Spark task <--> 1 Spark partition <--> 1 HDFS block

 

When a job is run, Spark makes a determination of where to execute the task based on certain factors such as available memory or cores on a node, where the data is located in a cluster, or available executors.

 

By default, spark waits for 3ms to prefer launching a task on a node where the actual data resides. This parameter is referred to as spark.locality.wait. When you've several tasks this can be a bottleneck and can increase the overall startup time, however, remember that having a data local to the task can actually reduce the time a task takes to complete. Please explore this value (try with 0ms) and keep in mind the pros and cons of this. For a detailed discussion see this link

Also note that spark.locality.wait is only relevant when dynamic allocation is enabled (with static allocation, it has no idea what data you're even going to read when it requests containers, so it can't use any locality info). 

 

On a side note, you should also look at the number of partitions. With too many partitions, the task scheduling may take more time than the actual execution time. Ideally, a ratio (partitions to cores) of 2X (or 3X) should be a good place to start with. 

 

Amit

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.