<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Map reduce Flow clarification in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Map-reduce-Flow-clarification/m-p/236717#M85156</link>
    <description>&lt;P&gt;&lt;B&gt;&lt;EM&gt;@&lt;/EM&gt;&lt;/B&gt;&lt;A href="https://community.hortonworks.com/users/9789/vamsivalivetiedu.html" rel="nofollow noopener noreferrer" target="_blank"&gt;vamsi valiveti&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;&lt;EM&gt;   Shuffling &lt;/EM&gt;&lt;/B&gt;is the process of transferring data from the mappers to the reducers, so I think it is obvious that it is necessary for the reducers, since otherwise, they wouldn't be able to have any input (or input from every mapper). Shuffling can start even before the map phase has finished, to save some time. That's why you can see a reduce status greater than 0% (but less than 33%) when the map status is not yet 100%.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;Sorting &lt;/EM&gt;&lt;/STRONG&gt;saves time for the reducer, helping it easily distinguish when a new reduce task should start. It simply starts a new reduce task, when the next key in the sorted input data is different than the previous, to put it simply. Each reduce task takes a list of key-value pairs, but it has to call the reduce() method which takes a key-list(value) input, so it has to group values by key. It's easy to do so, if input data is pre-sorted (locally) in the map phase and simply merge-sorted in the reduce phase (since the reducers get data from many mappers).&lt;/P&gt;&lt;P&gt;A great source of information for these steps is this &lt;A href="http://developer.yahoo.com/hadoop/tutorial/module4.html#dataflow" rel="nofollow noopener noreferrer" target="_blank"&gt;Yahoo tutorial&lt;/A&gt;.&lt;BR /&gt;A nice graphical representation of this is the following:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="93380-mr.png" style="width: 723px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/14346i9A770CB04D82984E/image-size/medium?v=v2&amp;amp;px=400" role="button" title="93380-mr.png" alt="93380-mr.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Note that shuffling and sorting are not performed at all if you specify zero reducers (setNumReduceTasks(0)). Then, the MapReduce job stops at the map phase, and the map phase does not include any kind of sorting (so even the map phase is faster) &lt;A href="https://stackoverflow.com/questions/22141631/what-is-the-purpose-of-shuffling-and-sorting-phase-in-the-reducer-in-map-reduce" target="_blank" rel="nofollow noopener noreferrer"&gt;Ref&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Please accept the answer you found most useful&lt;/P&gt;</description>
    <pubDate>Sat, 17 Aug 2019 23:28:05 GMT</pubDate>
    <dc:creator>jagadeesan</dc:creator>
    <dc:date>2019-08-17T23:28:05Z</dc:date>
    <item>
      <title>Map reduce Flow clarification</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Map-reduce-Flow-clarification/m-p/236715#M85154</link>
      <description>&lt;P&gt;Which one will occur first in MapReduce Flow among shuffling and sorting?&lt;/P&gt;&lt;P&gt;To my knowledge shuffling will occur first and then Sorting? Correct me I am wrong.&lt;/P&gt;&lt;P&gt;Any body can explain these two things?&lt;/P&gt;&lt;P&gt;Below statement from the Definative guide:&lt;/P&gt;&lt;P&gt;MapReduce makes the guarantee that the input to every reducer is sorted by key. The process by which the system performs the sort—and transfers the map outputs to the reducers as inputs—is known as the shuffle.&lt;/P&gt;</description>
      <pubDate>Sat, 24 Nov 2018 19:37:21 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Map-reduce-Flow-clarification/m-p/236715#M85154</guid>
      <dc:creator>vamsi123</dc:creator>
      <dc:date>2018-11-24T19:37:21Z</dc:date>
    </item>
    <item>
      <title>Re: Map reduce Flow clarification</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Map-reduce-Flow-clarification/m-p/236716#M85155</link>
      <description>&lt;P&gt;Hi experts&lt;/P&gt;&lt;P&gt;Anybody Input on my mail?&lt;/P&gt;</description>
      <pubDate>Mon, 26 Nov 2018 20:29:59 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Map-reduce-Flow-clarification/m-p/236716#M85155</guid>
      <dc:creator>vamsi123</dc:creator>
      <dc:date>2018-11-26T20:29:59Z</dc:date>
    </item>
    <item>
      <title>Re: Map reduce Flow clarification</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Map-reduce-Flow-clarification/m-p/236717#M85156</link>
      <description>&lt;P&gt;&lt;B&gt;&lt;EM&gt;@&lt;/EM&gt;&lt;/B&gt;&lt;A href="https://community.hortonworks.com/users/9789/vamsivalivetiedu.html" rel="nofollow noopener noreferrer" target="_blank"&gt;vamsi valiveti&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;&lt;EM&gt;   Shuffling &lt;/EM&gt;&lt;/B&gt;is the process of transferring data from the mappers to the reducers, so I think it is obvious that it is necessary for the reducers, since otherwise, they wouldn't be able to have any input (or input from every mapper). Shuffling can start even before the map phase has finished, to save some time. That's why you can see a reduce status greater than 0% (but less than 33%) when the map status is not yet 100%.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;Sorting &lt;/EM&gt;&lt;/STRONG&gt;saves time for the reducer, helping it easily distinguish when a new reduce task should start. It simply starts a new reduce task, when the next key in the sorted input data is different than the previous, to put it simply. Each reduce task takes a list of key-value pairs, but it has to call the reduce() method which takes a key-list(value) input, so it has to group values by key. It's easy to do so, if input data is pre-sorted (locally) in the map phase and simply merge-sorted in the reduce phase (since the reducers get data from many mappers).&lt;/P&gt;&lt;P&gt;A great source of information for these steps is this &lt;A href="http://developer.yahoo.com/hadoop/tutorial/module4.html#dataflow" rel="nofollow noopener noreferrer" target="_blank"&gt;Yahoo tutorial&lt;/A&gt;.&lt;BR /&gt;A nice graphical representation of this is the following:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="93380-mr.png" style="width: 723px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/14346i9A770CB04D82984E/image-size/medium?v=v2&amp;amp;px=400" role="button" title="93380-mr.png" alt="93380-mr.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Note that shuffling and sorting are not performed at all if you specify zero reducers (setNumReduceTasks(0)). Then, the MapReduce job stops at the map phase, and the map phase does not include any kind of sorting (so even the map phase is faster) &lt;A href="https://stackoverflow.com/questions/22141631/what-is-the-purpose-of-shuffling-and-sorting-phase-in-the-reducer-in-map-reduce" target="_blank" rel="nofollow noopener noreferrer"&gt;Ref&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Please accept the answer you found most useful&lt;/P&gt;</description>
      <pubDate>Sat, 17 Aug 2019 23:28:05 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Map-reduce-Flow-clarification/m-p/236717#M85156</guid>
      <dc:creator>jagadeesan</dc:creator>
      <dc:date>2019-08-17T23:28:05Z</dc:date>
    </item>
  </channel>
</rss>

