Support Questions

Find answers, ask questions, and share your expertise

Number of intermediate files with Sort shuffle in Spark

avatar
Rising Star

Hi everyone!

 

i trying to understand Sort shuffle in spark and will very appreciate if someone could answer on simple question, let's imagine:

1) i have 600 partitions (HDFS blocks, for simplicity)

2) it place in 6 node cluster

3) i run spark with follow parameters: 

--executor-memory 13G --executor-cores 6 --num-executors 12 --driver-memory 1G --properties-file my-config.conf

that's  mean that on each server i will have 2 executor with 6 core each.

 

4) according my config reduce phase has only 3 reducers.

 

so, ny question is how many files on each servers will be after Sort Shuffle:

- 12 like a active map task 

- 2 like a number of executors on each server

- 100 like a number of partitions that place on this server (for simplicity i just devide 600 on 6)

 

and the second question is how names buffer for storing intermediate data before spill it on disk on the map stage?

 

thanks!

1 ACCEPTED SOLUTION

avatar
Rising Star
Hi, As described in the sort based shuffle design doc (https://issues.apache.org/jira/secure/attachment/12655884/Sort-basedshuffledesign.pdf), each map task should generate 1 shuffle data file 1 index file. Regarding your second question, the property to specify the buffer for shuffle data is "spark.shuffle.memoryFraction". This is discussed in more detail in the following Cloudera blog: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ Regards, Bjorn

View solution in original post

1 REPLY 1

avatar
Rising Star
Hi, As described in the sort based shuffle design doc (https://issues.apache.org/jira/secure/attachment/12655884/Sort-basedshuffledesign.pdf), each map task should generate 1 shuffle data file 1 index file. Regarding your second question, the property to specify the buffer for shuffle data is "spark.shuffle.memoryFraction". This is discussed in more detail in the following Cloudera blog: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ Regards, Bjorn