Created on 07-26-2014 09:44 PM - edited 09-16-2022 02:03 AM
I am reading this article
http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
I am having problems in visualizing how this code will execute in a distributed environment.
So when I package this jar and execute this on jar on a hadoop cluster. Below is my understanding of things and also my doubts and questions
1. First the Run method will be called which will setup the JobConf object and will run the code. (which machine will the main method execute on? the job tracker node? the task tracker node?
2. Now suppose a machine is randomly chosen to run the main method. My understanding is that this JAR file will be serialized and sent to few machines running task tracker where the map funcion will be run first. For this, the input file will be split and fragments will be serialized to the nodes running the map tasks. (Question here is that does hadoop persist these split files as well on HDFS... or are the splits in memory?)
3. The map function will create a key value pair and will sort it as well. (Question here is that does hadoop persist the output of the map functions to HDFS before giving it off to the reduce processes?)
4. Now hadoop will start reduce processes accross the cluster to run the reduce code. This code will be given teh ouput of the map tasks.
5. My biggest confusion is that after each reduce has run and we have output from each reduce process. how do we then merge those outputs into the final output?
So for example, if we were calculating the value of pi (there is a sample for that) .... how is the final value calculated from the output of different reduce tasks?
Sorry if this question is very basic or very broad... I am just trying to lean stuff.
Created 07-27-2014 06:43 AM
Created 07-27-2014 06:43 AM