Support Questions

missing-link · ‎03-06-2014

Greetings,
I was wondering if we can combine multiple "calculations" within a single job that operate on the same data.

For example lets go back to the basic wordcount. Given a large document, along with
a) the total number of words, we want to count
b) the total number of sentences and
c) the total number of paragraphs in it.

All three tasks operate on the same data, and differ only on the delimiter of the tokenization. Is it valid to combine all three within the same job in the same mapreduce program, or do we have to write three individual programs and run each on the whole dataset?

So to generalize: can we "combine" different calculations on the same data? What if one of the calculations needed to emit different <key, value> data types (both on the mapper and the reducer)? What are the pros and cons of such an approach? Is it safe? Will it be faster than running three jobs separately?

Can the reducer emit different output files for each calculation? What is the best implementation for the whole approach?

Thanks in advance!

Harsh J · ‎03-07-2014

Wherever possible, it is always good to combine multiple goals into the same job to avoid extra I/O and additional job bootup/cleanup overheads. Optimisers in Hive and Pig try to do this as well.

The reducer can emit different outputs via the MultipleOutputs classes. For map outputs, you'll need to build some form of "general" class and hierarchy that can cater to different styles of output and partition them appropriately. Its definitely possible, but needs some additional work.

View solution in original post

Harsh J · ‎03-07-2014

Wherever possible, it is always good to combine multiple goals into the same job to avoid extra I/O and additional job bootup/cleanup overheads. Optimisers in Hive and Pig try to do this as well.

The reducer can emit different outputs via the MultipleOutputs classes. For map outputs, you'll need to build some form of "general" class and hierarchy that can cater to different styles of output and partition them appropriately. Its definitely possible, but needs some additional work.

Cloudera Community

Support Questions

Combining calculations within the same job