Support Questions

Find answers, ask questions, and share your expertise

Sort the output of a MapReduce job

avatar
Explorer

@Rich Raposa

One of the objective in HDPCD:Java exam is to sort "output" of MR job using http://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/mapreduce/Job.html#setGroupingComparatorC... My understanding is grouping comparator is for grouping records from multiple partitions. How can this be used for sorting? Do you mean using setSortComparatorClass?

Thanks for your help!

1 ACCEPTED SOLUTION

avatar
Guru

Sorting in MR applies to two areas:

  1. Sort output by keys: this done "naturally" in the sense that the keys are sorted as they come into the reducer. The compareTo method in the key class determines this natural sorting
  2. Secondary sort: both the keys and values are sorted. That involves writing a group comparator class and then registering that class with the MR Job using the setGroupingComparator class

The exam objective you listed above is referring to both. The first one is fairly straightforward - you implement the compareTo method in your key class. The secondary sort involves a bit more work. There is a nice blog here that has an example of how to implement a secondary sort:

https://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-values-in-hadoops-mapreduce-p...

View solution in original post

2 REPLIES 2

avatar
Guru

Sorting in MR applies to two areas:

  1. Sort output by keys: this done "naturally" in the sense that the keys are sorted as they come into the reducer. The compareTo method in the key class determines this natural sorting
  2. Secondary sort: both the keys and values are sorted. That involves writing a group comparator class and then registering that class with the MR Job using the setGroupingComparator class

The exam objective you listed above is referring to both. The first one is fairly straightforward - you implement the compareTo method in your key class. The secondary sort involves a bit more work. There is a nice blog here that has an example of how to implement a secondary sort:

https://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-values-in-hadoops-mapreduce-p...

avatar
Explorer

Thanks @Rich Raposa. It was actually little confusing to see ONLY setGroupingComparator mentioned in the objective, while secondary-sort involves writing comparator classes for sorting/grouping and using both setSortComparatorClass and setGroupingComparator methods.