<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: NewBee Question on Map reduce in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/NewBee-Question-on-Map-reduce/m-p/16040#M2383</link>
    <description>(1) The "driver" part of run/main code that sets up and submits a job&lt;BR /&gt;executes where you invoke it. It does not execute remotely.&lt;BR /&gt;&lt;BR /&gt;(2) See (1), cause it invalidates the supposition. But for the actual&lt;BR /&gt;Map and Reduce code execution instead, the point is true.&lt;BR /&gt;&lt;BR /&gt;(3) This is true as well.&lt;BR /&gt;&lt;BR /&gt;(4) This is incorrect. All output "collector" received data is stored&lt;BR /&gt;to disk (in an MR-provided storage termed 'intermediate storage')&lt;BR /&gt;after it runs through the partitioner (which divides them into&lt;BR /&gt;individual local files pertaining to each target reducer), and the&lt;BR /&gt;sorter (which runs quick sorts on the whole individual partition&lt;BR /&gt;segments).&lt;BR /&gt;&lt;BR /&gt;(5) Functionally true, but it is actually the Reduce that "pulls" the&lt;BR /&gt;map outputs stored across the cluster, instead of something sending&lt;BR /&gt;reducers the data (i.e. push). The reducer fetches its specific&lt;BR /&gt;partition file from all executed maps that produced one such file, and&lt;BR /&gt;merge sorts all these segments before invoking the user API of&lt;BR /&gt;reduce(…) function. The merge sorter does not require that the entire&lt;BR /&gt;set of segments fit into memory at once - it does the work in phases&lt;BR /&gt;if it does not have adequate memory.&lt;BR /&gt;&lt;BR /&gt;However, if the entire fetched output does not fit into the alloted&lt;BR /&gt;disk of the reduce task host, the reduce task will fail. We try a bit&lt;BR /&gt;to approximate and not schedule reduces on such a host, but if no host&lt;BR /&gt;can fit the aggregate data, then you likely will want to increase the&lt;BR /&gt;number of reducers (partitions) to divide up the amount of data&lt;BR /&gt;received per reduce task as a natural solution.&lt;BR /&gt;&lt;BR /&gt;</description>
    <pubDate>Sun, 27 Jul 2014 13:43:10 GMT</pubDate>
    <dc:creator>Harsh J</dc:creator>
    <dc:date>2014-07-27T13:43:10Z</dc:date>
    <item>
      <title>NewBee Question on Map reduce</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/NewBee-Question-on-Map-reduce/m-p/16026#M2382</link>
      <description>&lt;P&gt;I am reading this article&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;A target="_blank" href="http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html"&gt;http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am having problems in visualizing how this code will execute in a distributed environment.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;So when I package this jar and execute this on &amp;nbsp;jar on a hadoop cluster. Below is my understanding of things and also my doubts and questions&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;1. First the Run method will be called which will setup the JobConf object and will run the code. (which machine will the main method execute on? the job tracker node? the task tracker node?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;2. Now suppose a machine is randomly chosen to run the main method. My understanding is that this JAR file will be serialized and sent to few machines running task tracker where the map funcion will be run first. &amp;nbsp;For this, the input file will be split and fragments will be serialized to the nodes running the map tasks. (Question here is that does hadoop persist these split files as well on HDFS... or are the splits in memory?)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;3. The map function will create a key value pair and will sort it as well. (Question here is that does hadoop persist the output of the map functions to HDFS before giving it off to the reduce processes?)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="line-height: 14px;"&gt;4. Now hadoop will start reduce processes accross the cluster to run the reduce code. This code will be given teh ouput of the map tasks.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="line-height: 14px;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="line-height: 14px;"&gt;5. My biggest confusion is that after each reduce has run and we have output from each reduce process. how do we then merge those outputs into the final output?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="line-height: 14px;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="line-height: 14px;"&gt;So for example, if we were calculating the value of pi (there is a sample for that) .... how is the final value calculated from the output of different reduce tasks?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="line-height: 14px;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="line-height: 14px;"&gt;Sorry if this question is very basic or very broad... I am just trying to lean stuff.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 09:03:19 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/NewBee-Question-on-Map-reduce/m-p/16026#M2382</guid>
      <dc:creator>abhishes</dc:creator>
      <dc:date>2022-09-16T09:03:19Z</dc:date>
    </item>
    <item>
      <title>Re: NewBee Question on Map reduce</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/NewBee-Question-on-Map-reduce/m-p/16040#M2383</link>
      <description>(1) The "driver" part of run/main code that sets up and submits a job&lt;BR /&gt;executes where you invoke it. It does not execute remotely.&lt;BR /&gt;&lt;BR /&gt;(2) See (1), cause it invalidates the supposition. But for the actual&lt;BR /&gt;Map and Reduce code execution instead, the point is true.&lt;BR /&gt;&lt;BR /&gt;(3) This is true as well.&lt;BR /&gt;&lt;BR /&gt;(4) This is incorrect. All output "collector" received data is stored&lt;BR /&gt;to disk (in an MR-provided storage termed 'intermediate storage')&lt;BR /&gt;after it runs through the partitioner (which divides them into&lt;BR /&gt;individual local files pertaining to each target reducer), and the&lt;BR /&gt;sorter (which runs quick sorts on the whole individual partition&lt;BR /&gt;segments).&lt;BR /&gt;&lt;BR /&gt;(5) Functionally true, but it is actually the Reduce that "pulls" the&lt;BR /&gt;map outputs stored across the cluster, instead of something sending&lt;BR /&gt;reducers the data (i.e. push). The reducer fetches its specific&lt;BR /&gt;partition file from all executed maps that produced one such file, and&lt;BR /&gt;merge sorts all these segments before invoking the user API of&lt;BR /&gt;reduce(…) function. The merge sorter does not require that the entire&lt;BR /&gt;set of segments fit into memory at once - it does the work in phases&lt;BR /&gt;if it does not have adequate memory.&lt;BR /&gt;&lt;BR /&gt;However, if the entire fetched output does not fit into the alloted&lt;BR /&gt;disk of the reduce task host, the reduce task will fail. We try a bit&lt;BR /&gt;to approximate and not schedule reduces on such a host, but if no host&lt;BR /&gt;can fit the aggregate data, then you likely will want to increase the&lt;BR /&gt;number of reducers (partitions) to divide up the amount of data&lt;BR /&gt;received per reduce task as a natural solution.&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Sun, 27 Jul 2014 13:43:10 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/NewBee-Question-on-Map-reduce/m-p/16040#M2383</guid>
      <dc:creator>Harsh J</dc:creator>
      <dc:date>2014-07-27T13:43:10Z</dc:date>
    </item>
  </channel>
</rss>

