Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Number of Tasks created for each reducer

avatar
Super Collaborator

For every Reducer certain number of tasks are created.

Can someone explain what is the factor which decides number of tasks to be created for each reducer

1 ACCEPTED SOLUTION

avatar

@Viswa

According to official apache document by default number of reducers is set to 1

You can override this by using the following properties:

For MR1 set mapred.reduce.tasks=N

For MR2 set mapreduce.job.reduces=N

The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * <no. of maximum containers per node>).

With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.

Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.

The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the framework for speculative-tasks and failed tasks.

Now to understand the number of tasks spawned I would point you to this blog

In MR1, the number of tasks launched per node was specified via the settings mapred.map.tasks.maximum and mapred.reduce.tasks.maximum.

In MR2, one can determine how many concurrent tasks are launched per node by dividing the resources allocated to YARN by the resources allocated to each MapReduce task, and taking the minimum of the two types of resources (memory and CPU).

Specifically, you take the minimum of yarn.nodemanager.resource.memory-mb divided by mapreduce.[map|reduce].memory.mb and yarn.nodemanager.resource.cpu-vcores divided by mapreduce.[map|reduce].cpu.vcores. This will give you the number of tasks that will be spawned per node.

View solution in original post

1 REPLY 1

avatar

@Viswa

According to official apache document by default number of reducers is set to 1

You can override this by using the following properties:

For MR1 set mapred.reduce.tasks=N

For MR2 set mapreduce.job.reduces=N

The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * <no. of maximum containers per node>).

With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.

Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.

The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the framework for speculative-tasks and failed tasks.

Now to understand the number of tasks spawned I would point you to this blog

In MR1, the number of tasks launched per node was specified via the settings mapred.map.tasks.maximum and mapred.reduce.tasks.maximum.

In MR2, one can determine how many concurrent tasks are launched per node by dividing the resources allocated to YARN by the resources allocated to each MapReduce task, and taking the minimum of the two types of resources (memory and CPU).

Specifically, you take the minimum of yarn.nodemanager.resource.memory-mb divided by mapreduce.[map|reduce].memory.mb and yarn.nodemanager.resource.cpu-vcores divided by mapreduce.[map|reduce].cpu.vcores. This will give you the number of tasks that will be spawned per node.