MapReduce FAQ

by Community Manager on ‎08-20-2015 01:22 PM

Q: When a Mapper processes the intermediate output data, how are the number and size of the partitions(reducers) calculated?
A: The partitioner uses the user defined values in the job configuration numPartitions. A reducer is used for each partition. In 1.x code, look at, in the constructors of internal classes OldOutputCollector(Stable API) and
NewOutputCollector (New API). The data estimated to be going into a partition, for limit/scheduling
checks, is currently a naive computation, done by summing upon the estimate output sizes of each map.



  • Note: See ResourceEstimator#getEstimatedReduceInputSize for the overall estimation across maps, and see Task#calculateOutputSize for the per-map estimation code.

Q: When a JobTracker assigns a task to a reducer does it also specify the location of the intermediate data?
A: The JobTracker does not send the information of locations when a reduce is scheduled. When the reducers begin the shuffle phase, they query the TaskTracker to get the map completion events, via the TaskTracker#getMapCompletionEvents protocol call. The TaskTracker by itself calls the JobTracker#getTaskCompletionEvents protocol call to get this information underneath. The returned structure carries the host that has completed the map successfully, which the Reducer's copier relies on to fetch the data from the correct host's TaskTracker.

Q: How does a reducer determine which portion of the remote intermediate output data to retrieve?
A: The reducer asks each TaskTracker for the completed maps of the assigned data.



  • Note: A reducer task ID is also its partition ID, so the reducer asks each TaskTracker for the intermediate data marked with its own task ID # which is then sent over HTTP.

Q: How many reducers should I use for my job?
A: The answer depends on the job, but in general, if too many small files are created this will cause more time spent in I/O. A lower number of reducers will create fewer, but larger, output files. A good rule of thumb is to tune the number of reducers so that the output files are at least a half a block size. The right number of reducers seems to be between 0.95 or 1.75 multiplied by (nodes * mapred.tasktracker.tasks.maximum). At a multiplier of 0.95 all of the reducers can launch immediately and start transferring map outputs as the maps finish. At 1.75 the faster nodes will finish their first round of reduces and launch a second round of reduces doing a much better job of load balancing.

Q: How can more resources be allocated to a running MapReduce job?
A: There isn't a systematic or elegant way to add resources to a running job.
The inelegant (and untested) alternative is to shut down the tasktrackers one at a time, update the configuration files to add slots, and start them again. This process will cause some tasks to fail, and could cause a JobTracker deciding that map outputs on a given TaskTracker can't be fetched, then re-running those maps elsewhere.

Q: How do I manage MapReduce and HDFS on nodes with different disk sizes?
A: MapReduce can spill to disk without bounds in clusters containing nodes with different disk sizes.
If MapReduce jobs are not tuned correctly, they could eat all the space that was capacity planned for the HDFS.
The solution is to bound MapReduce to a fixed volume of disk space. Using separate partitions for MapReduce ensures spills-to-disk do not overrun the HDBS disks. One solution is to configure MapReduce to use a dedicated MapReduce file system with a fixed size from each disk pool. If the fixed size is too big, reset the quota and return that space to the HDFS so that if user jobs compete for disk space, HDFS will remain safe.

  • Note: Non-pooled storage lacks flexibility but the concept is the same.
Q: What are the steps to enable HA mapreduce in CDH 5.x?
A: Please see

Q: What is the role of a MR gateway ? 
A: Gateway roles are typically used in clusters to identify specific nodes to receive a particular service's client configuration. For example a node that has a Mapreduce gateway will ensure to get the mapreduce client configuration settings deployed in its /etc/hadoop/conf location. Please note however, that nodes that already have a mapreduce service configured (i.e. JT, or TT), will automatically get their /etc/hadoop/conf locations updated regardless of it also containing a gateway service or not. 

The important items to note regarding when to use Gateway roles is when the node(s) in question do not have a mapreduce service running on such nodes and need the mapreduce client configuration, such as nodes that are intended to be used as "edge nodes" where jobs will be submitted from those hosts. 

Q: If a MR gateway is not configured in the mapreduce service, will there be any issues? 
A: Please see the above response. If the node submitting Mapreduce jobs already has a JT or TT service running on it, then a mapreduce gateway role is not needed. However, if no mapreduce services are running on that node, then a mapreduce gateway role will be need to be added to that node. 

Q: Our cluster is configured with JT (MR) HA. Will there any any issues with upgrading to a newer version of CDH?
A: As long as our update documentation is followed, there should be no problems when upgrading.
NOTE: This article was taken from our internal Knowledge Base.  To access the original article please use the following link (customer login required):
Disclaimer: The information contained in this article was generated by third-parties and not by Cloudera or it's personnel. Cloudera cannot guarantee its accuracy or efficacy. Cloudera disclaims all warranties of any kind and users of this information assume all risk associated with it and with following the advice or directions contained herein. By visiting this page, you agree to be bound by the Terms and Conditions of Site Usage , including all disclaimers and limitations contained therein.