<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Why Map job is launched when I run SELECT * FROM tablename; in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Why-Map-job-is-launched-when-I-run-SELECT-FROM-tablename/m-p/158653#M41070</link>
    <description>&lt;A rel="user" href="https://community.cloudera.com/users/12513/jknulst.html" nodeid="12513"&gt;@Jasper&lt;/A&gt;,&lt;P&gt;Split size is not equivalent to block size. Split size is configurable and its advisable that split size should be greater than block size and splits will always be done for reducing the no.of mapper tasks.&lt;/P&gt;</description>
    <pubDate>Mon, 19 Sep 2016 22:53:47 GMT</pubDate>
    <dc:creator>gsrao_cse</dc:creator>
    <dc:date>2016-09-19T22:53:47Z</dc:date>
    <item>
      <title>Why Map job is launched when I run SELECT * FROM tablename;</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Why-Map-job-is-launched-when-I-run-SELECT-FROM-tablename/m-p/158650#M41067</link>
      <description>&lt;P&gt;I have loaded 1 GB file to HDFS and then created hive table on top of this.&lt;/P&gt;&lt;P&gt;Details:&lt;/P&gt;&lt;P&gt;Block size =2MB (Here we have configured block size as 2Mb for the sake of checking these kind of scenarios)&lt;/P&gt;&lt;P&gt;Split size=128 Mb&lt;/P&gt;&lt;P&gt;When I fire a SELECT * FROM tablename, I see 9 mapper jobs are launched.&lt;/P&gt;&lt;P&gt;I have read many places like there will not be any map jobs for select * from table.&lt;/P&gt;&lt;P&gt;Could some one explain why map jobs are launched in this case&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="7755-hive-select.png" style="width: 977px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/20814i363C1EB2884F8625/image-size/medium?v=v2&amp;amp;px=400" role="button" title="7755-hive-select.png" alt="7755-hive-select.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 18 Aug 2019 12:10:31 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Why-Map-job-is-launched-when-I-run-SELECT-FROM-tablename/m-p/158650#M41067</guid>
      <dc:creator>gsrao_cse</dc:creator>
      <dc:date>2019-08-18T12:10:31Z</dc:date>
    </item>
    <item>
      <title>Re: Why Map job is launched when I run SELECT * FROM tablename;</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Why-Map-job-is-launched-when-I-run-SELECT-FROM-tablename/m-p/158651#M41068</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/11566/gsraocse.html" nodeid="11566"&gt;@srinivasa  rao&lt;/A&gt; &lt;/P&gt;&lt;P&gt;This behaviour is directed by some of the hive performance tuning settings of the &lt;STRONG&gt;&lt;EM&gt;hive.fetch.*&lt;/EM&gt;&lt;/STRONG&gt; family. They decide on whether a shortcut to just go at the (table)file in HDFS without any MR/Tez is wanted and/or feasible.&lt;/P&gt;&lt;P&gt;There are a few of them:&lt;/P&gt;&lt;PRE&gt;hive.fetch.task.conversion

hive.fetch.task.conversion.threshold

hive.fetch.task.aggr&lt;/PRE&gt;&lt;P&gt;The default is &lt;STRONG&gt;&lt;EM&gt;hive.fetch.task.conversion=more &lt;/EM&gt;&lt;/STRONG&gt;and it means that going straight at the data (without spinning up mappers) is default. It works even if you query for only 1 col out of many. &lt;/P&gt;&lt;P&gt;If it is set to &lt;STRONG&gt;&lt;EM&gt;none&lt;/EM&gt;&lt;/STRONG&gt; or &lt;STRONG&gt;&lt;EM&gt;minimal&lt;/EM&gt;&lt;/STRONG&gt; then you probably need to put in the &lt;STRONG&gt;&lt;EM&gt;limit x&lt;/EM&gt;&lt;/STRONG&gt; clause to have the same bypass of any map functions. I think your env does not have it set to &lt;STRONG&gt;&lt;EM&gt;more&lt;/EM&gt;&lt;/STRONG&gt; or the threshold value is too low.&lt;/P&gt;&lt;P&gt;There is some more info about these settings &lt;A href="https://vcfvct.wordpress.com/2016/02/18/make-hive-query-faster-with-fetch-task/"&gt;here&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 19 Sep 2016 21:35:39 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Why-Map-job-is-launched-when-I-run-SELECT-FROM-tablename/m-p/158651#M41068</guid>
      <dc:creator>jknulst</dc:creator>
      <dc:date>2016-09-19T21:35:39Z</dc:date>
    </item>
    <item>
      <title>Re: Why Map job is launched when I run SELECT * FROM tablename;</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Why-Map-job-is-launched-when-I-run-SELECT-FROM-tablename/m-p/158652#M41069</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/11566/gsraocse.html" nodeid="11566"&gt;@srinivasa  rao&lt;/A&gt; &lt;/P&gt;&lt;P&gt;If you have HDFS block size set to 2MB, then split size will also be 2MB. These 2 entities are connected.&lt;/P&gt;</description>
      <pubDate>Mon, 19 Sep 2016 21:39:33 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Why-Map-job-is-launched-when-I-run-SELECT-FROM-tablename/m-p/158652#M41069</guid>
      <dc:creator>jknulst</dc:creator>
      <dc:date>2016-09-19T21:39:33Z</dc:date>
    </item>
    <item>
      <title>Re: Why Map job is launched when I run SELECT * FROM tablename;</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Why-Map-job-is-launched-when-I-run-SELECT-FROM-tablename/m-p/158653#M41070</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/12513/jknulst.html" nodeid="12513"&gt;@Jasper&lt;/A&gt;,&lt;P&gt;Split size is not equivalent to block size. Split size is configurable and its advisable that split size should be greater than block size and splits will always be done for reducing the no.of mapper tasks.&lt;/P&gt;</description>
      <pubDate>Mon, 19 Sep 2016 22:53:47 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Why-Map-job-is-launched-when-I-run-SELECT-FROM-tablename/m-p/158653#M41070</guid>
      <dc:creator>gsrao_cse</dc:creator>
      <dc:date>2016-09-19T22:53:47Z</dc:date>
    </item>
    <item>
      <title>Re: Why Map job is launched when I run SELECT * FROM tablename;</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Why-Map-job-is-launched-when-I-run-SELECT-FROM-tablename/m-p/158654#M41071</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/12513/jknulst.html" nodeid="12513" target="_blank"&gt;@Jasper&lt;/A&gt; &lt;/P&gt;&lt;P&gt;Below are my configurations at cluster level.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="7783-fetch.png" style="width: 875px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/20813iA5009F411D528653/image-size/medium?v=v2&amp;amp;px=400" role="button" title="7783-fetch.png" alt="7783-fetch.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;it is still launching map job when I run SELECT * FROM tablename;&lt;/P&gt;</description>
      <pubDate>Sun, 18 Aug 2019 12:10:23 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Why-Map-job-is-launched-when-I-run-SELECT-FROM-tablename/m-p/158654#M41071</guid>
      <dc:creator>gsrao_cse</dc:creator>
      <dc:date>2019-08-18T12:10:23Z</dc:date>
    </item>
    <item>
      <title>Re: Why Map job is launched when I run SELECT * FROM tablename;</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Why-Map-job-is-launched-when-I-run-SELECT-FROM-tablename/m-p/158655#M41072</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/11566/gsraocse.html" nodeid="11566"&gt;@srinivasa rao&lt;/A&gt; you are seeing 9 mapper due to tezsplitgrouper which actually groups the no of original splits for better parallelism,this is a nice article explaining how initial task parallelism works &lt;A href="https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works" target="_blank"&gt;https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 19 Sep 2016 23:22:24 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Why-Map-job-is-launched-when-I-run-SELECT-FROM-tablename/m-p/158655#M41072</guid>
      <dc:creator>rajkumar_singh</dc:creator>
      <dc:date>2016-09-19T23:22:24Z</dc:date>
    </item>
    <item>
      <title>Re: Why Map job is launched when I run SELECT * FROM tablename;</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Why-Map-job-is-launched-when-I-run-SELECT-FROM-tablename/m-p/158656#M41073</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/11566/gsraocse.html" nodeid="11566"&gt;@srinivasa  rao&lt;/A&gt; &lt;/P&gt;&lt;P&gt;Play with the threshold value. Set it to a higher value (2GB)&lt;/P&gt;</description>
      <pubDate>Mon, 19 Sep 2016 23:48:11 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Why-Map-job-is-launched-when-I-run-SELECT-FROM-tablename/m-p/158656#M41073</guid>
      <dc:creator>jknulst</dc:creator>
      <dc:date>2016-09-19T23:48:11Z</dc:date>
    </item>
    <item>
      <title>Re: Why Map job is launched when I run SELECT * FROM tablename;</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Why-Map-job-is-launched-when-I-run-SELECT-FROM-tablename/m-p/158657#M41074</link>
      <description>&lt;P&gt;@&lt;A href="https://community.hortonworks.com/users/11566/gsraocse.html"&gt;srinivasa rao&lt;/A&gt;&lt;/P&gt;&lt;P&gt;I guess you read about when you perform a "select * from &amp;lt;tablename&amp;gt;", Hive fetches the whole data from file as a FetchTask  rather than a mapreduce task which just dumps the data as it is without doing anything on it, similar to "hadoop dfs -text &amp;lt;filename&amp;gt;"&lt;/P&gt;&lt;P&gt;However, the above does not take advantage of the true parallelism. In your case, for 1 GB will not make the difference, but image a 100 TB table and you do use a single threaded task in a cluster with 1000 nodes. FetchTask is not a good use of parallelism. Tez provides some options to split the data set to allow true parallelism.&lt;/P&gt;&lt;P&gt;tez.grouping.max-size and  tez.grouping.min-size are split parameters.&lt;/P&gt;&lt;P&gt;Ref: &lt;A href="https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_installing_manually_book/content/ref-ffec9e6b-41f4-47de-b5cd-1403b4c4a7c8.1.html" target="_blank"&gt;https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_installing_manually_book/content/ref-ffec9e6b-41f4-47de-b5cd-1403b4c4a7c8.1.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;If any of the responses was helpful, please don't forget to vote/accept the answer.&lt;/P&gt;</description>
      <pubDate>Tue, 20 Sep 2016 00:20:31 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Why-Map-job-is-launched-when-I-run-SELECT-FROM-tablename/m-p/158657#M41074</guid>
      <dc:creator>cstanca</dc:creator>
      <dc:date>2016-09-20T00:20:31Z</dc:date>
    </item>
  </channel>
</rss>

