Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hadoop Sorting: Custom Sort and Sort Optimization

Hadoop Sorting: Custom Sort and Sort Optimization

New Contributor

Here is my understanding of the way sorting works in Hadoop. Based on that I have a few questions. As per my understanding, It basically follows the following order of sorting:

  1. Custom sorting is looked for which is set using job.setSortComparator() class. Is this understanding right?

    1.1 The job.setSortComparator() method takes a RawComparator implementation which has only the "raw" compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) method

    1.2 This means that the class needed to implement custom sort needs to implement the compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) method.

    1.3 This also means that it will be this compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2)method that will ultimately be called for my custom sort.

    1.4 Now, if I create my custom class MyWritableComparator that implements the RawComparator interface, and implement the compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) method, then it is as good as Optimized Sorting from point 2.1.1 as this is also a byte level comparison. Is my understanding right?

    1.5 If I instead have my custom class MyWritableComparator extend the WritableComparator(which already by default implements the compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) method and overrides the compare(WritableComparable a, WritableComparable b) method), and then have my custom class override the compare(WritableComparable a, WritableComparable b) method, then during sorting: my MyWritableComparator instance will be used ===> the default implementation of the compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) method from WritableComparator will be called which deserializes them into WritableComparable objects ===> and then calls the compare(WritableComparable a, WritableComparable b) that MyWritableComparator instance has overridden ===> and this is how Custom Sort is being achieved

  2. If Custom Sorting Order from above is not set, Optimized sorting is looked for which is set using WritableComparator.define() 2.1 WritableComparator.define() takes a WritableComparator which has 2 crucial methods: 2.1.1 compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) which according to the documentation is the hook for optimization because of byte level comparisons 2.1.2 compare(WritableComparable a, WritableComparable b) - which if overridden is as good as Custom Sort Order. If this is true then how is it different from Job.setSortComparator() which ideally is used for Custom Sorting as mentioned in point 1?

  3. If Optimized Sorting is also not set from the above method, then the Default Natural Sorting Order is looked for which uses WritableComparator, the default implementation

So basically:

  • as far as sort optimization is concerned, points 1.4 and 2.1.1 both achieve the same thing but Hadoop Definitive Guide suggests only point 2.1.1

  • as far as custom sort is concerned, 1.5 and 2.1.2 achieve the same thing but Hadoop Definitive Guide suggests only point 1.5

  • Also, why does job.setSortComparator need a RawComparator implementation. Sorting is required only on keys in MR right? And keys have to be WritableComparable and therefore the job.setSortComparator method should have instead taken a WritableComparator implementation. Any sort needed on "values" in MR is anyway handled by the developer himself as there is no API for sorting values unlike keys I believe. Is my understanding right?

Any insights would be very helpful.