Support Questions

Find answers, ask questions, and share your expertise

NewBee Question on Map reduce

avatar
Expert Contributor

I am reading this article 

 

http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

 

I am having problems in visualizing how this code will execute in a distributed environment.

 

So when I package this jar and execute this on  jar on a hadoop cluster. Below is my understanding of things and also my doubts and questions 

 

1. First the Run method will be called which will setup the JobConf object and will run the code. (which machine will the main method execute on? the job tracker node? the task tracker node?

 

2. Now suppose a machine is randomly chosen to run the main method. My understanding is that this JAR file will be serialized and sent to few machines running task tracker where the map funcion will be run first.  For this, the input file will be split and fragments will be serialized to the nodes running the map tasks. (Question here is that does hadoop persist these split files as well on HDFS... or are the splits in memory?)

 

3. The map function will create a key value pair and will sort it as well. (Question here is that does hadoop persist the output of the map functions to HDFS before giving it off to the reduce processes?)

 

4. Now hadoop will start reduce processes accross the cluster to run the reduce code. This code will be given teh ouput of the map tasks.

 

5. My biggest confusion is that after each reduce has run and we have output from each reduce process. how do we then merge those outputs into the final output?

 

So for example, if we were calculating the value of pi (there is a sample for that) .... how is the final value calculated from the output of different reduce tasks?

 

Sorry if this question is very basic or very broad... I am just trying to lean stuff.

1 ACCEPTED SOLUTION

avatar
Mentor
(1) The "driver" part of run/main code that sets up and submits a job
executes where you invoke it. It does not execute remotely.

(2) See (1), cause it invalidates the supposition. But for the actual
Map and Reduce code execution instead, the point is true.

(3) This is true as well.

(4) This is incorrect. All output "collector" received data is stored
to disk (in an MR-provided storage termed 'intermediate storage')
after it runs through the partitioner (which divides them into
individual local files pertaining to each target reducer), and the
sorter (which runs quick sorts on the whole individual partition
segments).

(5) Functionally true, but it is actually the Reduce that "pulls" the
map outputs stored across the cluster, instead of something sending
reducers the data (i.e. push). The reducer fetches its specific
partition file from all executed maps that produced one such file, and
merge sorts all these segments before invoking the user API of
reduce(…) function. The merge sorter does not require that the entire
set of segments fit into memory at once - it does the work in phases
if it does not have adequate memory.

However, if the entire fetched output does not fit into the alloted
disk of the reduce task host, the reduce task will fail. We try a bit
to approximate and not schedule reduces on such a host, but if no host
can fit the aggregate data, then you likely will want to increase the
number of reducers (partitions) to divide up the amount of data
received per reduce task as a natural solution.

View solution in original post

1 REPLY 1

avatar
Mentor
(1) The "driver" part of run/main code that sets up and submits a job
executes where you invoke it. It does not execute remotely.

(2) See (1), cause it invalidates the supposition. But for the actual
Map and Reduce code execution instead, the point is true.

(3) This is true as well.

(4) This is incorrect. All output "collector" received data is stored
to disk (in an MR-provided storage termed 'intermediate storage')
after it runs through the partitioner (which divides them into
individual local files pertaining to each target reducer), and the
sorter (which runs quick sorts on the whole individual partition
segments).

(5) Functionally true, but it is actually the Reduce that "pulls" the
map outputs stored across the cluster, instead of something sending
reducers the data (i.e. push). The reducer fetches its specific
partition file from all executed maps that produced one such file, and
merge sorts all these segments before invoking the user API of
reduce(…) function. The merge sorter does not require that the entire
set of segments fit into memory at once - it does the work in phases
if it does not have adequate memory.

However, if the entire fetched output does not fit into the alloted
disk of the reduce task host, the reduce task will fail. We try a bit
to approximate and not schedule reduces on such a host, but if no host
can fit the aggregate data, then you likely will want to increase the
number of reducers (partitions) to divide up the amount of data
received per reduce task as a natural solution.