Created 01-28-2016 10:35 AM
While running an Mapreduce job in YARN, does each physical block run one map or reduce task? apologize if my question is not correct as am a newbie
Created 01-28-2016 10:44 AM
Take a look at this https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduc...
How Many Maps?
The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.
The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes a while, so it is best if the maps take at least a minute to execute.
Thus, if you expect 10TB of input data and have a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.
Reducer
Reducer reduces a set of intermediate values which share a key to a smaller set of values. The number of reduces for the job is set by the user via Job.setNumReduceTasks(int).
Overall, Reducer implementations are passed the Job for the job via the Job.setReducerClass(Class) method and can override it to initialize themselves. The framework then calls reduce(WritableComparable, Iterable<Writable>, Context) method for each <key, (list of values)> pair in the grouped inputs. Applications can then override the cleanup(Context) method to perform any required cleanup.
Reducer has 3 primary phases: shuffle, sort and reduce.
So yes usually you have a map task for every block (unless configured differently). The number of reducers are set by the user when the job is triggered.
This might also be helpful https://wiki.apache.org/hadoop/HowManyMapsAndReduces
Created 01-28-2016 10:44 AM
Take a look at this https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduc...
How Many Maps?
The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.
The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes a while, so it is best if the maps take at least a minute to execute.
Thus, if you expect 10TB of input data and have a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.
Reducer
Reducer reduces a set of intermediate values which share a key to a smaller set of values. The number of reduces for the job is set by the user via Job.setNumReduceTasks(int).
Overall, Reducer implementations are passed the Job for the job via the Job.setReducerClass(Class) method and can override it to initialize themselves. The framework then calls reduce(WritableComparable, Iterable<Writable>, Context) method for each <key, (list of values)> pair in the grouped inputs. Applications can then override the cleanup(Context) method to perform any required cleanup.
Reducer has 3 primary phases: shuffle, sort and reduce.
So yes usually you have a map task for every block (unless configured differently). The number of reducers are set by the user when the job is triggered.
This might also be helpful https://wiki.apache.org/hadoop/HowManyMapsAndReduces
Created 01-28-2016 04:58 PM
Thanks Jonas
Created 01-28-2016 10:47 AM
I suppose you mean HDFS blocks. And the answer is it depends:
Normally MapReduce will start one Map task for each block. However that is not always the case. MapReduce provides CombinedFileInputFormat that merge small files into single tasks. In pig you can for example enable/disable this with pig.splitCombination.
Tez which is an alternative framework to MapReduce and which is used under Hive and optionally pig also merges small files into map tasks, this is controlled by tez.grouping.min-size and tez.grouping.max-size.
For Reducers it is more complicated since they work on map output and not on files. In MapReduce the number of reducers needs to be specified by the user. -Dmapred.reduce.tasks=x
However frameworks like hive predict the number of reducers that is optimal and set it depending on data sizes. For example hive ( on tez ) uses hive.exec.reducers.bytes.per.reducer to predict an optimal number of reducers. Pig does something similar. Here the parameter is pig.exec.reducers.bytes.per.reducer.
Hope that helps.
Created 01-28-2016 04:59 PM
Thanks Benjamin for valuable inputs
Created 01-28-2016 06:19 PM
Map tasks are controlled in MapReduce by default by the number of blocks, you will get 1 mapper for 1 block. This can be configured so that you can take more or less data into a single map task with the below configs, in the below case we are taking in ~1gb - 1.5gb to the maptask rather then the default block of 128mb. Reduces can be configured as some of the other comments but you cannot have more reduces then you have distinct Keys emitted from the Maps.
mapreduce.input.fileinputformat.split.minsize=1000000000
mapreduce.input.fileinputformat.split.maxsize=1500000000
mapreduce.job.reduces=10
Created 01-28-2016 06:30 PM
Ah cool didn't know that one. I suppose this is only honored if you use CombinedFileInputFormat and its child classes? I am pretty sure the standard FileInputFormat uses one file at most.
Created 01-30-2016 01:45 PM
Ah that's a good catch. Thanks Joseph
Created 01-29-2016 01:43 AM
no need to apologize, that's what the HCC is for -- besides, you're NOT a newb anymore!!