Support Questions

fil · ‎07-18-2015

Hi everyone!

i trying to understand Sort shuffle in spark and will very appreciate if someone could answer on simple question, let's imagine:

1) i have 600 partitions (HDFS blocks, for simplicity)

2) it place in 6 node cluster

3) i run spark with follow parameters:

--executor-memory 13G --executor-cores 6 --num-executors 12 --driver-memory 1G --properties-file my-config.conf

that's mean that on each server i will have 2 executor with 6 core each.

4) according my config reduce phase has only 3 reducers.

so, ny question is how many files on each servers will be after Sort Shuffle:

- 12 like a active map task

- 2 like a number of executors on each server

- 100 like a number of partitions that place on this server (for simplicity i just devide 600 on 6)

and the second question is how names buffer for storing intermediate data before spill it on disk on the map stage?

thanks!

bjorn.jonsson · ‎08-10-2015

Hi, As described in the sort based shuffle design doc (https://issues.apache.org/jira/secure/attachment/12655884/Sort-basedshuffledesign.pdf), each map task should generate 1 shuffle data file 1 index file. Regarding your second question, the property to specify the buffer for shuffle data is "spark.shuffle.memoryFraction". This is discussed in more detail in the following Cloudera blog: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ Regards, Bjorn

bjorn.jonsson · ‎08-10-2015

Hi, As described in the sort based shuffle design doc (https://issues.apache.org/jira/secure/attachment/12655884/Sort-basedshuffledesign.pdf), each map task should generate 1 shuffle data file 1 index file. Regarding your second question, the property to specify the buffer for shuffle data is "spark.shuffle.memoryFraction". This is discussed in more detail in the following Cloudera blog: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ Regards, Bjorn

Number of intermediate files with Sort shuffle in Spark