Expert Contributor
Posts: 87
Registered: ‎09-17-2014
Accepted Solution

Number of intermediate files with Sort shuffle in Spark

Hi everyone!


i trying to understand Sort shuffle in spark and will very appreciate if someone could answer on simple question, let's imagine:

1) i have 600 partitions (HDFS blocks, for simplicity)

2) it place in 6 node cluster

3) i run spark with follow parameters: 

--executor-memory 13G --executor-cores 6 --num-executors 12 --driver-memory 1G --properties-file my-config.conf

that's  mean that on each server i will have 2 executor with 6 core each.


4) according my config reduce phase has only 3 reducers.


so, ny question is how many files on each servers will be after Sort Shuffle:

- 12 like a active map task 

- 2 like a number of executors on each server

- 100 like a number of partitions that place on this server (for simplicity i just devide 600 on 6)


and the second question is how names buffer for storing intermediate data before spill it on disk on the map stage?



Cloudera Employee
Posts: 26
Registered: ‎07-20-2014

Re: Number of intermediate files with Sort shuffle in Spark

Hi, As described in the sort based shuffle design doc (, each map task should generate 1 shuffle data file 1 index file. Regarding your second question, the property to specify the buffer for shuffle data is "spark.shuffle.memoryFraction". This is discussed in more detail in the following Cloudera blog: Regards, Bjorn