I am using pyspark2.3.0 . With below syntax
pyspark2 --driver-memory 4G --executor-cores 3 --driver-cores 3 --num-executors 16 --executor-memory 4G
On execution of above statement making I am reserving the above cores mentioned
Now the problem is I am firing below query, Below mentioned table is partitioned and it is of 450GB
spark.sql("select * from tabl_name limit 20").show()
On firing above query we have ,There are 10000 tasks created and execution starts at end of 4th min.
Is there any help so that I could reduce the time to wait till 4 min. Please let me know if any parameter I should be setting
Thanks and Regards,
Naveen Srikanth D
By design, Number of tasks = Number of partitions.
Typically, there is a partition for each HDFS block being read.
This means: 1 Spark task <--> 1 Spark partition <--> 1 HDFS block
When a job is run, Spark makes a determination of where to execute the task based on certain factors such as available memory or cores on a node, where the data is located in a cluster, or available executors.
By default, spark waits for 3ms to prefer launching a task on a node where the actual data resides. This parameter is referred to as spark.locality.wait. When you've several tasks this can be a bottleneck and can increase the overall startup time, however, remember that having a data local to the task can actually reduce the time a task takes to complete. Please explore this value (try with 0ms) and keep in mind the pros and cons of this. For a detailed discussion see this link
Also note that spark.locality.wait is only relevant when dynamic allocation is enabled (with static allocation, it has no idea what data you're even going to read when it requests containers, so it can't use any locality info).
On a side note, you should also look at the number of partitions. With too many partitions, the task scheduling may take more time than the actual execution time. Ideally, a ratio (partitions to cores) of 2X (or 3X) should be a good place to start with.