Support Questions

Find answers, ask questions, and share your expertise

Why we have to do multiple times sorting operation in map reduce execution

Contributor

1. Per my knowledge each mapper output, initially write into buffer. If buffer reaches 80% of its memory then it will push into disk and while pushing into disk Partitioning and sorting will be happen. Here my question is if we are partitioning what is the reason for doing sorting?

2. In "Shuffle and Sort" phase also sorting will be done. Any reason for sorting here?

3. In reducer phase also sorting will be done? Is there any reason that we are giving sorted output to end user?

I have seen like nearly 3 times we are doing sorting and Sorting is too costly operation.

Can some one help in understanding the reason for doing these many times sorting

1 ACCEPTED SOLUTION

1+2) Its simple the way hadoop works. MapReduce guarantees that the input to the reducers is sorted. There are two reasons for this:

a) By definition a reducer is an operation on a key and ALL values that belong to that key regarding from which mapper they come. A reducer simply could read the full input set and create a big hashmap in memory but this would be ridiculously costly so the other option is to sort the input dataset. So it simply reads all values for key1 and the moment it sees key2 it knows that there will be no more values for key1. So you see we have to sort the reducer input to enable the reducer function.

b) Sorting the keys gives a lot of nice benefits like the ability to do a global sort more or less for free.

3) Reducers only merge sort the input of the different mappers. so that they have a globally sorted input list, this is low effort since the input sets are sorted.

4) "I have seen like nearly 3 times we are doing sorting and Sorting is too costly operation.

No you only sort once. The output of the mappers is sorted and reducers merge sort the inputs from the mappers. It is a single global sort operation. The mappers "local" sort their output and the reducer merges these parts together. And as explained above you HAVE to sort the reducer input for the reducer to work.

View solution in original post

2 REPLIES 2

1+2) Its simple the way hadoop works. MapReduce guarantees that the input to the reducers is sorted. There are two reasons for this:

a) By definition a reducer is an operation on a key and ALL values that belong to that key regarding from which mapper they come. A reducer simply could read the full input set and create a big hashmap in memory but this would be ridiculously costly so the other option is to sort the input dataset. So it simply reads all values for key1 and the moment it sees key2 it knows that there will be no more values for key1. So you see we have to sort the reducer input to enable the reducer function.

b) Sorting the keys gives a lot of nice benefits like the ability to do a global sort more or less for free.

3) Reducers only merge sort the input of the different mappers. so that they have a globally sorted input list, this is low effort since the input sets are sorted.

4) "I have seen like nearly 3 times we are doing sorting and Sorting is too costly operation.

No you only sort once. The output of the mappers is sorted and reducers merge sort the inputs from the mappers. It is a single global sort operation. The mappers "local" sort their output and the reducer merges these parts together. And as explained above you HAVE to sort the reducer input for the reducer to work.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.