Support Questions

gsrao_cse · ‎07-22-2016

1. Per my knowledge each mapper output, initially write into buffer. If buffer reaches 80% of its memory then it will push into disk and while pushing into disk Partitioning and sorting will be happen. Here my question is if we are partitioning what is the reason for doing sorting?

2. In "Shuffle and Sort" phase also sorting will be done. Any reason for sorting here?

3. In reducer phase also sorting will be done? Is there any reason that we are giving sorted output to end user?

I have seen like nearly 3 times we are doing sorting and Sorting is too costly operation.

Can some one help in understanding the reason for doing these many times sorting

bleonhardi · ‎07-22-2016

1+2) Its simple the way hadoop works. MapReduce guarantees that the input to the reducers is sorted. There are two reasons for this:

a) By definition a reducer is an operation on a key and ALL values that belong to that key regarding from which mapper they come. A reducer simply could read the full input set and create a big hashmap in memory but this would be ridiculously costly so the other option is to sort the input dataset. So it simply reads all values for key1 and the moment it sees key2 it knows that there will be no more values for key1. So you see we have to sort the reducer input to enable the reducer function.

b) Sorting the keys gives a lot of nice benefits like the ability to do a global sort more or less for free.

3) Reducers only merge sort the input of the different mappers. so that they have a globally sorted input list, this is low effort since the input sets are sorted.

4) "I have seen like nearly 3 times we are doing sorting and Sorting is too costly operation.

No you only sort once. The output of the mappers is sorted and reducers merge sort the inputs from the mappers. It is a single global sort operation. The mappers "local" sort their output and the reducer merges these parts together. And as explained above you HAVE to sort the reducer input for the reducer to work.

View solution in original post

bleonhardi · ‎07-22-2016

1+2) Its simple the way hadoop works. MapReduce guarantees that the input to the reducers is sorted. There are two reasons for this:

a) By definition a reducer is an operation on a key and ALL values that belong to that key regarding from which mapper they come. A reducer simply could read the full input set and create a big hashmap in memory but this would be ridiculously costly so the other option is to sort the input dataset. So it simply reads all values for key1 and the moment it sees key2 it knows that there will be no more values for key1. So you see we have to sort the reducer input to enable the reducer function.

b) Sorting the keys gives a lot of nice benefits like the ability to do a global sort more or less for free.

3) Reducers only merge sort the input of the different mappers. so that they have a globally sorted input list, this is low effort since the input sets are sorted.

4) "I have seen like nearly 3 times we are doing sorting and Sorting is too costly operation.

No you only sort once. The output of the mappers is sorted and reducers merge sort the inputs from the mappers. It is a single global sort operation. The mappers "local" sort their output and the reducer merges these parts together. And as explained above you HAVE to sort the reducer input for the reducer to work.

rajkumar_singh · ‎07-22-2016

please refer http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/de//archive/... most importantly the implementation

Cloudera Community

Support Questions

Why we have to do multiple times sorting operation in map reduce execution

Performance Delays in Namenode Caused by Multiple ...

Understanding Spark through Map Reduce

What is the difference between Partitioner, Combin...

hadoop files list sorted by time

Map and Reduce Error: Java heap space

Reducing Cloud Spend: Cost Strategies for Cloudera...

Map Reduce job on YARN hangs in ACCEPTED state

Cloudera Flow Management Operator - A technical de...

How Map and Reduce operations are actually carried...

Troubleshooting ambari operation execution using "...