Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Number of intermediate files with Sort shuffle in Spark

Solved Go to solution

Number of intermediate files with Sort shuffle in Spark

Rising Star

Hi everyone!

 

i trying to understand Sort shuffle in spark and will very appreciate if someone could answer on simple question, let's imagine:

1) i have 600 partitions (HDFS blocks, for simplicity)

2) it place in 6 node cluster

3) i run spark with follow parameters: 

--executor-memory 13G --executor-cores 6 --num-executors 12 --driver-memory 1G --properties-file my-config.conf

that's  mean that on each server i will have 2 executor with 6 core each.

 

4) according my config reduce phase has only 3 reducers.

 

so, ny question is how many files on each servers will be after Sort Shuffle:

- 12 like a active map task 

- 2 like a number of executors on each server

- 100 like a number of partitions that place on this server (for simplicity i just devide 600 on 6)

 

and the second question is how names buffer for storing intermediate data before spill it on disk on the map stage?

 

thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Number of intermediate files with Sort shuffle in Spark

Cloudera Employee
Hi, As described in the sort based shuffle design doc (https://issues.apache.org/jira/secure/attachment/12655884/Sort-basedshuffledesign.pdf), each map task should generate 1 shuffle data file 1 index file. Regarding your second question, the property to specify the buffer for shuffle data is "spark.shuffle.memoryFraction". This is discussed in more detail in the following Cloudera blog: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ Regards, Bjorn
1 REPLY 1

Re: Number of intermediate files with Sort shuffle in Spark

Cloudera Employee
Hi, As described in the sort based shuffle design doc (https://issues.apache.org/jira/secure/attachment/12655884/Sort-basedshuffledesign.pdf), each map task should generate 1 shuffle data file 1 index file. Regarding your second question, the property to specify the buffer for shuffle data is "spark.shuffle.memoryFraction". This is discussed in more detail in the following Cloudera blog: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ Regards, Bjorn