Created 07-14-2017 06:19 PM
For every Reducer certain number of tasks are created.
Can someone explain what is the factor which decides number of tasks to be created for each reducer
Created 07-14-2017 06:56 PM
According to official apache document by default number of reducers is set to 1
You can override this by using the following properties:
For MR1 set mapred.reduce.tasks=N
For MR2 set mapreduce.job.reduces=N
The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * <no. of maximum containers per node>).
With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.
Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.
The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the framework for speculative-tasks and failed tasks.
Now to understand the number of tasks spawned I would point you to this blog
In MR1, the number of tasks launched per node was specified via the settings mapred.map.tasks.maximum and mapred.reduce.tasks.maximum.
In MR2, one can determine how many concurrent tasks are launched per node by dividing the resources allocated to YARN by the resources allocated to each MapReduce task, and taking the minimum of the two types of resources (memory and CPU).
Specifically, you take the minimum of yarn.nodemanager.resource.memory-mb divided by mapreduce.[map|reduce].memory.mb and yarn.nodemanager.resource.cpu-vcores divided by mapreduce.[map|reduce].cpu.vcores. This will give you the number of tasks that will be spawned per node.
Created 07-14-2017 06:56 PM
According to official apache document by default number of reducers is set to 1
You can override this by using the following properties:
For MR1 set mapred.reduce.tasks=N
For MR2 set mapreduce.job.reduces=N
The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * <no. of maximum containers per node>).
With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.
Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.
The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the framework for speculative-tasks and failed tasks.
Now to understand the number of tasks spawned I would point you to this blog
In MR1, the number of tasks launched per node was specified via the settings mapred.map.tasks.maximum and mapred.reduce.tasks.maximum.
In MR2, one can determine how many concurrent tasks are launched per node by dividing the resources allocated to YARN by the resources allocated to each MapReduce task, and taking the minimum of the two types of resources (memory and CPU).
Specifically, you take the minimum of yarn.nodemanager.resource.memory-mb divided by mapreduce.[map|reduce].memory.mb and yarn.nodemanager.resource.cpu-vcores divided by mapreduce.[map|reduce].cpu.vcores. This will give you the number of tasks that will be spawned per node.