Created on 08-13-2018 12:00 PM - edited 09-16-2022 06:35 AM
How can we change no of Mappers for a MapReduce job?
Created 08-13-2018 12:06 PM
No, The number of map tasks for a given job is driven by the number of input splits. For each input split a map task is spawned. So, we cannot directly change the number of mappers using a config other than changing the number of input splits.
Created 08-13-2018 12:16 PM
\r\n dfs.block.size>\r\n 134217728\r\n \r\n"}" data-sheets-userformat="{"2":769,"3":[null,0],"11":4,"12":0}">Directly we cannot change the number of mappers for a MapReduce job but by changing the block size we can increase or decrease the number of mappers.
As we know
Number of input splits = Number of mappers
Example
If we are having 1TB of input file and the block size for the HDFS is 128MB then number of input splits are (1024/128) 8 input splits so the mappers for the job allotted are 8.
If we reduce the block size from 128MB to 64Mb then 1TB of Input file will be divided in to (1024/64) 16 Input splits and the number of mappers also be 16.
The block size can be changed in hdfs-site.xml by changing the value of dfs.block.size
<property>
<name>dfs.block.size>
<value>134217728</value>
</property>
Created 08-13-2018 12:17 PM
If you would want to have fixed number of reducer at runtime, you can do it while passing the Map/Reduce job at the command line. Using “-D mapred.reduce.tasks” with the desired number will spawn that many reducers at runtime. The number of Mappers for a MapReduce job is driven by number of input splits. And input splits are dependent upon the Block size. For eg If we have 500MB of data and 128MB is the block size in hdfs , then approximately the number of mapper will be equal to 4 mappers.
When you are running an hadoop job on the CLI you can use the -D switch to change the default of mappers and reducers can be settings like (5 mappers, 2 reducers):
-D mapred.map.tasks=5 -D mapred.reduce.tasks=2
Example
bin/hadoop jar -Dmapreduce.job.maps=5 yourapp.jar
HTH
Created 08-13-2018 12:59 PM
The split size is calculated by the formula:-
max(mapred.min.split.size, min(mapred.max.split.size, dfs.block.size))
Say, HDFS block size is 64 MB and min.input.size is set to 128MB, then there will be split size would be 128MB. To read 256MB of data, there will be two mappers. To increase the number of mappers, then you could decrease min.input.size till the HDFS block size.
split size=max(128,min(256,64))