Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Number of intermediate files with Sort shuffle in Spark

avatar
Rising Star

Hi everyone!

 

i trying to understand Sort shuffle in spark and will very appreciate if someone could answer on simple question, let's imagine:

1) i have 600 partitions (HDFS blocks, for simplicity)

2) it place in 6 node cluster

3) i run spark with follow parameters: 

--executor-memory 13G --executor-cores 6 --num-executors 12 --driver-memory 1G --properties-file my-config.conf

that's  mean that on each server i will have 2 executor with 6 core each.

 

4) according my config reduce phase has only 3 reducers.

 

so, ny question is how many files on each servers will be after Sort Shuffle:

- 12 like a active map task 

- 2 like a number of executors on each server

- 100 like a number of partitions that place on this server (for simplicity i just devide 600 on 6)

 

and the second question is how names buffer for storing intermediate data before spill it on disk on the map stage?

 

thanks!

1 ACCEPTED SOLUTION

avatar
Rising Star
Hi, As described in the sort based shuffle design doc (https://issues.apache.org/jira/secure/attachment/12655884/Sort-basedshuffledesign.pdf), each map task should generate 1 shuffle data file 1 index file. Regarding your second question, the property to specify the buffer for shuffle data is "spark.shuffle.memoryFraction". This is discussed in more detail in the following Cloudera blog: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ Regards, Bjorn

View solution in original post

1 REPLY 1

avatar
Rising Star
Hi, As described in the sort based shuffle design doc (https://issues.apache.org/jira/secure/attachment/12655884/Sort-basedshuffledesign.pdf), each map task should generate 1 shuffle data file 1 index file. Regarding your second question, the property to specify the buffer for shuffle data is "spark.shuffle.memoryFraction". This is discussed in more detail in the following Cloudera blog: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ Regards, Bjorn