Support Questions

clouderazone · ‎01-28-2016

While running an Mapreduce job in YARN, does each physical block run one map or reduce task? apologize if my question is not correct as am a newbie

jstraub · ‎01-28-2016

Take a look at this https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduc...

How Many Maps?

The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.

The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes a while, so it is best if the maps take at least a minute to execute.

Thus, if you expect 10TB of input data and have a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.

Reducer

Reducer reduces a set of intermediate values which share a key to a smaller set of values. The number of reduces for the job is set by the user via Job.setNumReduceTasks(int).

Overall, Reducer implementations are passed the Job for the job via the Job.setReducerClass(Class) method and can override it to initialize themselves. The framework then calls reduce(WritableComparable, Iterable<Writable>, Context) method for each <key, (list of values)> pair in the grouped inputs. Applications can then override the cleanup(Context) method to perform any required cleanup.

Reducer has 3 primary phases: shuffle, sort and reduce.

So yes usually you have a map task for every block (unless configured differently). The number of reducers are set by the user when the job is triggered.

This might also be helpful https://wiki.apache.org/hadoop/HowManyMapsAndReduces

View solution in original post

jstraub · ‎01-28-2016

Take a look at this https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduc...

How Many Maps?

The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.

The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes a while, so it is best if the maps take at least a minute to execute.

Thus, if you expect 10TB of input data and have a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.

Reducer

Reducer reduces a set of intermediate values which share a key to a smaller set of values. The number of reduces for the job is set by the user via Job.setNumReduceTasks(int).

Overall, Reducer implementations are passed the Job for the job via the Job.setReducerClass(Class) method and can override it to initialize themselves. The framework then calls reduce(WritableComparable, Iterable<Writable>, Context) method for each <key, (list of values)> pair in the grouped inputs. Applications can then override the cleanup(Context) method to perform any required cleanup.

Reducer has 3 primary phases: shuffle, sort and reduce.

So yes usually you have a map task for every block (unless configured differently). The number of reducers are set by the user when the job is triggered.

This might also be helpful https://wiki.apache.org/hadoop/HowManyMapsAndReduces

clouderazone · ‎01-28-2016

Thanks Jonas

bleonhardi · ‎01-28-2016

I suppose you mean HDFS blocks. And the answer is it depends:

Normally MapReduce will start one Map task for each block. However that is not always the case. MapReduce provides CombinedFileInputFormat that merge small files into single tasks. In pig you can for example enable/disable this with pig.splitCombination.

Tez which is an alternative framework to MapReduce and which is used under Hive and optionally pig also merges small files into map tasks, this is controlled by tez.grouping.min-size and tez.grouping.max-size.

For Reducers it is more complicated since they work on map output and not on files. In MapReduce the number of reducers needs to be specified by the user. -Dmapred.reduce.tasks=x

However frameworks like hive predict the number of reducers that is optimal and set it depending on data sizes. For example hive ( on tez ) uses hive.exec.reducers.bytes.per.reducer to predict an optimal number of reducers. Pig does something similar. Here the parameter is pig.exec.reducers.bytes.per.reducer.

Hope that helps.

clouderazone · ‎01-28-2016

Thanks Benjamin for valuable inputs

jniemiec · ‎01-28-2016

Map tasks are controlled in MapReduce by default by the number of blocks, you will get 1 mapper for 1 block. This can be configured so that you can take more or less data into a single map task with the below configs, in the below case we are taking in ~1gb - 1.5gb to the maptask rather then the default block of 128mb. Reduces can be configured as some of the other comments but you cannot have more reduces then you have distinct Keys emitted from the Maps.

mapreduce.input.fileinputformat.split.minsize=1000000000

mapreduce.input.fileinputformat.split.maxsize=1500000000

mapreduce.job.reduces=10

bleonhardi · ‎01-28-2016

Ah cool didn't know that one. I suppose this is only honored if you use CombinedFileInputFormat and its child classes? I am pretty sure the standard FileInputFormat uses one file at most.

clouderazone · ‎01-30-2016

Ah that's a good catch. Thanks Joseph

LesterMartin · ‎01-29-2016

no need to apologize, that's what the HCC is for -- besides, you're NOT a newb anymore!!