Number of intermediate files with Sort shuffle in Spark

fil — Fri, 16 Sep 2022 09:34:43 GMT

Hi everyone!

i trying to understand Sort shuffle in spark and will very appreciate if someone could answer on simple question, let's imagine:

1) i have 600 partitions (HDFS blocks, for simplicity)

2) it place in 6 node cluster

3) i run spark with follow parameters:

--executor-memory 13G --executor-cores 6 --num-executors 12 --driver-memory 1G --properties-file my-config.conf

that's mean that on each server i will have 2 executor with 6 core each.

4) according my config reduce phase has only 3 reducers.

so, ny question is how many files on each servers will be after Sort Shuffle:

- 12 like a active map task

- 2 like a number of executors on each server

- 100 like a number of partitions that place on this server (for simplicity i just devide 600 on 6)

and the second question is how names buffer for storing intermediate data before spill it on disk on the map stage?

thanks!

Re: Number of intermediate files with Sort shuffle in Spark

bjorn.jonsson — Tue, 11 Aug 2015 00:07:38 GMT

Hi, As described in the sort based shuffle design doc (https://issues.apache.org/jira/secure/attachment/12655884/Sort-basedshuffledesign.pdf), each map task should generate 1 shuffle data file 1 index file. Regarding your second question, the property to specify the buffer for shuffle data is "spark.shuffle.memoryFraction". This is discussed in more detail in the following Cloudera blog: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ Regards, Bjorn