Reply
Highlighted
Explorer
Posts: 15
Registered: ‎03-03-2017

For Pyspark Sql Query delay in picking up tasks for spark executors

Dear All,

 

I am using pyspark2.3.0 . With below syntax

        pyspark2 --driver-memory 4G --executor-cores 3 --driver-cores 3 --num-executors 16 --executor-memory 4G

 

On execution of above statement making I am reserving the above cores mentioned

 

Now the problem is I am firing below query, Below mentioned table is partitioned and it is of 450GB

        spark.sql("select * from tabl_name limit 20").show()

 

        On firing above query we have ,There are 10000 tasks created and execution starts at end of 4th min. 

 

Is there any help so that I could reduce the time to wait till 4 min. Please let me know if any parameter I should be setting

 

 

Thanks and Regards,

Naveen Srikanth D

 

Cloudera Employee
Posts: 75
Registered: ‎11-16-2015

Re: For Pyspark Sql Query delay in picking up tasks for spark executors

[ Edited ]

Hi @GokuNaveen

 

By design, Number of tasks = Number of partitions. 

Typically, there is a partition for each HDFS block being read.

 

This means: 1 Spark task <--> 1 Spark partition <--> 1 HDFS block

 

When a job is run, Spark makes a determination of where to execute the task based on certain factors such as available memory or cores on a node, where the data is located in a cluster, or available executors.

 

By default, spark waits for 3ms to prefer launching a task on a node where the actual data resides. This parameter is referred to as spark.locality.wait. When you've several tasks this can be a bottleneck and can increase the overall startup time, however, remember that having a data local to the task can actually reduce the time a task takes to complete. Please explore this value (try with 0ms) and keep in mind the pros and cons of this. For a detailed discussion see this link

Also note that spark.locality.wait is only relevant when dynamic allocation is enabled (with static allocation, it has no idea what data you're even going to read when it requests containers, so it can't use any locality info). 

 

On a side note, you should also look at the number of partitions. With too many partitions, the task scheduling may take more time than the actual execution time. Ideally, a ratio (partitions to cores) of 2X (or 3X) should be a good place to start with. 

 

Amit