question Re: Why Map job is launched when I run SELECT * FROM tablename; in Archives of Support Questions (Read Only)

Why Map job is launched when I run SELECT * FROM tablename;

gsrao_cse — Sun, 18 Aug 2019 12:10:31 GMT

I have loaded 1 GB file to HDFS and then created hive table on top of this.

Details:

Block size =2MB (Here we have configured block size as 2Mb for the sake of checking these kind of scenarios)

Split size=128 Mb

When I fire a SELECT * FROM tablename, I see 9 mapper jobs are launched.

I have read many places like there will not be any map jobs for select * from table.

Could some one explain why map jobs are launched in this case

Re: Why Map job is launched when I run SELECT * FROM tablename;

jknulst — Mon, 19 Sep 2016 21:35:39 GMT

@srinivasa rao

This behaviour is directed by some of the hive performance tuning settings of the hive.fetch.* family. They decide on whether a shortcut to just go at the (table)file in HDFS without any MR/Tez is wanted and/or feasible.

There are a few of them:

hive.fetch.task.conversion

hive.fetch.task.conversion.threshold

hive.fetch.task.aggr

The default is hive.fetch.task.conversion=more and it means that going straight at the data (without spinning up mappers) is default. It works even if you query for only 1 col out of many.

If it is set to none or minimal then you probably need to put in the limit x clause to have the same bypass of any map functions. I think your env does not have it set to more or the threshold value is too low.

There is some more info about these settings here

Re: Why Map job is launched when I run SELECT * FROM tablename;

jknulst — Mon, 19 Sep 2016 21:39:33 GMT

@srinivasa rao

If you have HDFS block size set to 2MB, then split size will also be 2MB. These 2 entities are connected.

Re: Why Map job is launched when I run SELECT * FROM tablename;

gsrao_cse — Mon, 19 Sep 2016 22:53:47 GMT

@Jasper,

Split size is not equivalent to block size. Split size is configurable and its advisable that split size should be greater than block size and splits will always be done for reducing the no.of mapper tasks.

Re: Why Map job is launched when I run SELECT * FROM tablename;

gsrao_cse — Sun, 18 Aug 2019 12:10:23 GMT

@Jasper

Below are my configurations at cluster level.

it is still launching map job when I run SELECT * FROM tablename;

Re: Why Map job is launched when I run SELECT * FROM tablename;

rajkumar_singh — Mon, 19 Sep 2016 23:22:24 GMT

@srinivasa rao you are seeing 9 mapper due to tezsplitgrouper which actually groups the no of original splits for better parallelism,this is a nice article explaining how initial task parallelism works https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works

Re: Why Map job is launched when I run SELECT * FROM tablename;

jknulst — Mon, 19 Sep 2016 23:48:11 GMT

@srinivasa rao

Play with the threshold value. Set it to a higher value (2GB)

Re: Why Map job is launched when I run SELECT * FROM tablename;

cstanca — Tue, 20 Sep 2016 00:20:31 GMT

@srinivasa rao

I guess you read about when you perform a "select * from <tablename>", Hive fetches the whole data from file as a FetchTask rather than a mapreduce task which just dumps the data as it is without doing anything on it, similar to "hadoop dfs -text <filename>"

However, the above does not take advantage of the true parallelism. In your case, for 1 GB will not make the difference, but image a 100 TB table and you do use a single threaded task in a cluster with 1000 nodes. FetchTask is not a good use of parallelism. Tez provides some options to split the data set to allow true parallelism.

tez.grouping.max-size and tez.grouping.min-size are split parameters.

Ref: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_installing_manually_book/content/ref-ffec9e6b-41f4-47de-b5cd-1403b4c4a7c8.1.html

If any of the responses was helpful, please don't forget to vote/accept the answer.