Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.
In the definitive guide of hadoop it is mentioned that
"When all the map outputs have been copied, the reduce task moves into the sort phase (which should properly be called the merge phase, as the sorting was carried out on the map side)".
Does this signify that there is no sorting done during the sort phase? Because we get the map partitions from different mappers and all these are not completely sorted, but are just sorted at the partition level.
But, the above statement sounds like these are not sorted in the merge phase or sort phase of reduce side shuffle and sort.
I believe what it's saying is that during the map phase, each partition is sorted, and during the 'sort' or 'merge' phase, those sorted partitions are merged so the entire dataset is sorted: just like the mergesort algorithm sorts small chunks of the data, and then uses the small chunks to create larger chunks of sorted data.