Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Shuffle and sort question

Shuffle and sort question

New Contributor

In the definitive guide of hadoop it is mentioned that

"When all the map outputs have been copied, the reduce task moves into the sort phase (which should properly be called the merge phase, as the sorting was carried out on the map side)".

Does this signify that there is no sorting done during the sort phase? Because we get the map partitions from different mappers and all these are not completely sorted, but are just sorted at the partition level.

But, the above statement sounds like these are not sorted in the merge phase or sort phase of reduce side shuffle and sort.


Re: Shuffle and sort question

Master Collaborator
I believe what it's saying is that during the map phase, each partition is
sorted, and during the 'sort' or 'merge' phase, those sorted partitions are
merged so the entire dataset is sorted: just like the mergesort algorithm
sorts small chunks of the data, and then uses the small chunks to create
larger chunks of sorted data.

Don't have an account?
Coming from Hortonworks? Activate your account here