New Contributor
Posts: 3
Registered: ‎06-19-2015

Shuffle and sort question

In the definitive guide of hadoop it is mentioned that

"When all the map outputs have been copied, the reduce task moves into the sort phase (which should properly be called the merge phase, as the sorting was carried out on the map side)".

Does this signify that there is no sorting done during the sort phase? Because we get the map partitions from different mappers and all these are not completely sorted, but are just sorted at the partition level.

But, the above statement sounds like these are not sorted in the merge phase or sort phase of reduce side shuffle and sort.

Cloudera Employee
Posts: 435
Registered: ‎07-12-2013

Re: Shuffle and sort question

I believe what it's saying is that during the map phase, each partition is
sorted, and during the 'sort' or 'merge' phase, those sorted partitions are
merged so the entire dataset is sorted: just like the mergesort algorithm
sorts small chunks of the data, and then uses the small chunks to create
larger chunks of sorted data.