i trying to understand Sort shuffle in spark and will very appreciate if someone could answer on simple question, let's imagine:
1) i have 600 partitions (HDFS blocks, for simplicity)
2) it place in 6 node cluster
3) i run spark with follow parameters:
--executor-memory 13G --executor-cores 6 --num-executors 12 --driver-memory 1G --properties-file my-config.conf
that's mean that on each server i will have 2 executor with 6 core each.
4) according my config reduce phase has only 3 reducers.
so, ny question is how many files on each servers will be after Sort Shuffle:
- 12 like a active map task
- 2 like a number of executors on each server
- 100 like a number of partitions that place on this server (for simplicity i just devide 600 on 6)
and the second question is how names buffer for storing intermediate data before spill it on disk on the map stage?