Support Questions

gsrao_cse · ‎09-19-2016

I have loaded 1 GB file to HDFS and then created hive table on top of this.

Details:

Block size =2MB (Here we have configured block size as 2Mb for the sake of checking these kind of scenarios)

Split size=128 Mb

When I fire a SELECT * FROM tablename, I see 9 mapper jobs are launched.

I have read many places like there will not be any map jobs for select * from table.

Could some one explain why map jobs are launched in this case

jknulst · ‎09-19-2016

@srinivasa rao

This behaviour is directed by some of the hive performance tuning settings of the hive.fetch.* family. They decide on whether a shortcut to just go at the (table)file in HDFS without any MR/Tez is wanted and/or feasible.

There are a few of them:

hive.fetch.task.conversion

hive.fetch.task.conversion.threshold

hive.fetch.task.aggr

The default is hive.fetch.task.conversion=more and it means that going straight at the data (without spinning up mappers) is default. It works even if you query for only 1 col out of many.

If it is set to none or minimal then you probably need to put in the limit x clause to have the same bypass of any map functions. I think your env does not have it set to more or the threshold value is too low.

There is some more info about these settings here

View solution in original post

jknulst · ‎09-19-2016

@srinivasa rao

This behaviour is directed by some of the hive performance tuning settings of the hive.fetch.* family. They decide on whether a shortcut to just go at the (table)file in HDFS without any MR/Tez is wanted and/or feasible.

There are a few of them:

hive.fetch.task.conversion

hive.fetch.task.conversion.threshold

hive.fetch.task.aggr

The default is hive.fetch.task.conversion=more and it means that going straight at the data (without spinning up mappers) is default. It works even if you query for only 1 col out of many.

If it is set to none or minimal then you probably need to put in the limit x clause to have the same bypass of any map functions. I think your env does not have it set to more or the threshold value is too low.

There is some more info about these settings here

gsrao_cse · ‎09-19-2016

@Jasper

Below are my configurations at cluster level.

it is still launching map job when I run SELECT * FROM tablename;

jknulst · ‎09-19-2016

@srinivasa rao

Play with the threshold value. Set it to a higher value (2GB)

jknulst · ‎09-19-2016

@srinivasa rao

If you have HDFS block size set to 2MB, then split size will also be 2MB. These 2 entities are connected.

gsrao_cse · ‎09-19-2016

@Jasper,

Split size is not equivalent to block size. Split size is configurable and its advisable that split size should be greater than block size and splits will always be done for reducing the no.of mapper tasks.

rajkumar_singh · ‎09-19-2016

@srinivasa rao you are seeing 9 mapper due to tezsplitgrouper which actually groups the no of original splits for better parallelism,this is a nice article explaining how initial task parallelism works https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works

cstanca · ‎09-19-2016

@srinivasa rao

I guess you read about when you perform a "select * from <tablename>", Hive fetches the whole data from file as a FetchTask rather than a mapreduce task which just dumps the data as it is without doing anything on it, similar to "hadoop dfs -text <filename>"

However, the above does not take advantage of the true parallelism. In your case, for 1 GB will not make the difference, but image a 100 TB table and you do use a single threaded task in a cluster with 1000 nodes. FetchTask is not a good use of parallelism. Tez provides some options to split the data set to allow true parallelism.

tez.grouping.max-size and tez.grouping.min-size are split parameters.

Ref: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_installing_manually_book/content/ref-ffe...

If any of the responses was helpful, please don't forget to vote/accept the answer.

Cloudera Community

Support Questions

Why Map job is launched when I run SELECT * FROM tablename;