Posts: 19
Registered: ‎06-23-2014

Best practices for "Default Number of Reduce Tasks per Job" (mapreduce.job.reduces)?

I've read conflicting advice for the correct value of "Default Number of Reduce Tasks per Job" (mapreduce.job.reduces) parameter in Yarn?

Cloudera Manger's default is listed as "1" - but other documentation claims that this value should be set to "99% of reduce capacity." - which, in the case of a 100 node cluster, might be 99. 
What is the recommended value for this parameter, on a busy cluster with many jobs running?
Posts: 19
Registered: ‎06-23-2014

Re: Best practices for "Default Number of Reduce Tasks per Job" (mapreduce.job.reduces)?

I think the best answer to this question is the following, by Allen Wittenauer from LinkedIn:


He writes:



At LinkedIn (company), I tend to tell users that their ideal reducers should be the optimal value that gets them closest to:
A multiple of the block size
A task time between 5 and 15 minutes
Creates the fewest files possible




Our community is getting a little larger. And a lot better.

Learn More about the Cloudera and Hortonworks community merger planned for late July and early August.