question Re: Sort the output of a MapReduce job in Archives of Support Questions (Read Only)

Sort the output of a MapReduce job

ungarala — Fri, 16 Sep 2022 10:09:47 GMT

One of the objective in HDPCD:Java exam is to sort "output" of MR job using http://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/mapreduce/Job.html#setGroupingComparatorClass(java.lang.Class) My understanding is grouping comparator is for grouping records from multiple partitions. How can this be used for sorting? Do you mean using setSortComparatorClass?

Thanks for your help!

Re: Sort the output of a MapReduce job

rich1 — Wed, 16 Mar 2016 21:42:38 GMT

Sorting in MR applies to two areas:

Sort output by keys: this done "naturally" in the sense that the keys are sorted as they come into the reducer. The compareTo method in the key class determines this natural sorting
Secondary sort: both the keys and values are sorted. That involves writing a group comparator class and then registering that class with the MR Job using the setGroupingComparator class

The exam objective you listed above is referring to both. The first one is fairly straightforward - you implement the compareTo method in your key class. The secondary sort involves a bit more work. There is a nice blog here that has an example of how to implement a secondary sort:

https://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-values-in-hadoops-mapreduce-programming-paradigm/

Re: Sort the output of a MapReduce job

ungarala — Wed, 16 Mar 2016 22:38:15 GMT

Thanks @Rich Raposa. It was actually little confusing to see ONLY setGroupingComparator mentioned in the objective, while secondary-sort involves writing comparator classes for sorting/grouping and using both setSortComparatorClass and setGroupingComparator methods.