Created 02-04-2016 07:35 AM
What is the difference between Partitioner, Combiner, Shuffle and sort phase in Map Reduce. What is the order of execution of these phases. My understanding of the process flow is as follows:
1) Each Map Task output is Partitioned and sorted in memory and Combiner functions runs on it. This output is written to local disk called as Intermediate Data.
2) All the intermediate data from all the DataNodes go through a phase called Shuffle and sort and which is taken care by Hadoop Framework.
3) Sorted output is given as input to Reducers.
Please verify if the process flow is correct and provide your valuable inputs.
Created 02-04-2016 09:20 AM
https://developer.yahoo.com/hadoop/tutorial/module4.html
Map -> Combiner -> Partitioner -> Sort -> Shuffle -> Sort -> Reduce
https://farm3.static.flickr.com/2374/3529959828_0b689d1d5c_o.png
https://farm3.static.flickr.com/2275/3529146683_c8247ff6db_o.png
Created 02-07-2016 08:12 PM
May I ask why you care? Any specific curiosity or performance problem or just curiosity?
Created 03-28-2017 02:24 PM
@ Benjamin Leonhardi Why sorting is written before shuffling? I think sorting always happen after the shuffling. As there is already combiner to combine(sort) the output on single node. I think when all intermediated data collected using shuffling then sorting is use to make one single input file, which will use by reducer.