Q: When a Mapper processes the intermediate output data, how are the number and size of the partitions(reducers) calculated?
A: The partitioner uses the user defined values in the job configuration numPartitions. A reducer is used for each partition. In 1.x code, look at MapTask.java, in the constructors of internal classes OldOutputCollector(Stable API) and
NewOutputCollector (New API). The data estimated to be going into a partition, for limit/scheduling
checks, is currently a naive computation, done by summing upon the estimate output sizes of each map.
Q: When a JobTracker assigns a task to a reducer does it also specify the location of the intermediate data?
A: The JobTracker does not send the information of locations when a reduce is scheduled. When the reducers begin the shuffle phase, they query the TaskTracker to get the map completion events, via the TaskTracker#getMapCompletionEvents protocol call. The TaskTracker by itself calls the JobTracker#getTaskCompletionEvents protocol call to get this information underneath. The returned structure carries the host that has completed the map successfully, which the Reducer's copier relies on to fetch the data from the correct host's TaskTracker.
Q: How does a reducer determine which portion of the remote intermediate output data to retrieve?
A: The reducer asks each TaskTracker for the completed maps of the assigned data.
Q: How many reducers should I use for my job?
A: The answer depends on the job, but in general, if too many small files are created this will cause more time spent in I/O. A lower number of reducers will create fewer, but larger, output files. A good rule of thumb is to tune the number of reducers so that the output files are at least a half a block size. The right number of reducers seems to be between 0.95 or 1.75 multiplied by (nodes * mapred.tasktracker.tasks.maximum). At a multiplier of 0.95 all of the reducers can launch immediately and start transferring map outputs as the maps finish. At 1.75 the faster nodes will finish their first round of reduces and launch a second round of reduces doing a much better job of load balancing.
Q: How can more resources be allocated to a running MapReduce job?
A: There isn't a systematic or elegant way to add resources to a running job.
The inelegant (and untested) alternative is to shut down the tasktrackers one at a time, update the configuration files to add slots, and start them again. This process will cause some tasks to fail, and could cause a JobTracker deciding that map outputs on a given TaskTracker can't be fetched, then re-running those maps elsewhere.
Q: How do I manage MapReduce and HDFS on nodes with different disk sizes?
A: MapReduce can spill to disk without bounds in clusters containing nodes with different disk sizes.
If MapReduce jobs are not tuned correctly, they could eat all the space that was capacity planned for the HDFS.
The solution is to bound MapReduce to a fixed volume of disk space. Using separate partitions for MapReduce ensures spills-to-disk do not overrun the HDBS disks. One solution is to configure MapReduce to use a dedicated MapReduce file system with a fixed size from each disk pool. If the fixed size is too big, reset the quota and return that space to the HDFS so that if user jobs compete for disk space, HDFS will remain safe.