Created on 09-19-2016 11:05 AM - edited 08-18-2019 05:10 AM
I have loaded 1 GB file to HDFS and then created hive table on top of this.
Details:
Block size =2MB (Here we have configured block size as 2Mb for the sake of checking these kind of scenarios)
Split size=128 Mb
When I fire a SELECT * FROM tablename, I see 9 mapper jobs are launched.
I have read many places like there will not be any map jobs for select * from table.
Could some one explain why map jobs are launched in this case
Created 09-19-2016 02:35 PM
This behaviour is directed by some of the hive performance tuning settings of the hive.fetch.* family. They decide on whether a shortcut to just go at the (table)file in HDFS without any MR/Tez is wanted and/or feasible.
There are a few of them:
hive.fetch.task.conversion hive.fetch.task.conversion.threshold hive.fetch.task.aggr
The default is hive.fetch.task.conversion=more and it means that going straight at the data (without spinning up mappers) is default. It works even if you query for only 1 col out of many.
If it is set to none or minimal then you probably need to put in the limit x clause to have the same bypass of any map functions. I think your env does not have it set to more or the threshold value is too low.
There is some more info about these settings here
Created 09-19-2016 02:35 PM
This behaviour is directed by some of the hive performance tuning settings of the hive.fetch.* family. They decide on whether a shortcut to just go at the (table)file in HDFS without any MR/Tez is wanted and/or feasible.
There are a few of them:
hive.fetch.task.conversion hive.fetch.task.conversion.threshold hive.fetch.task.aggr
The default is hive.fetch.task.conversion=more and it means that going straight at the data (without spinning up mappers) is default. It works even if you query for only 1 col out of many.
If it is set to none or minimal then you probably need to put in the limit x clause to have the same bypass of any map functions. I think your env does not have it set to more or the threshold value is too low.
There is some more info about these settings here
Created on 09-19-2016 03:56 PM - edited 08-18-2019 05:10 AM
Below are my configurations at cluster level.
it is still launching map job when I run SELECT * FROM tablename;
Created 09-19-2016 04:48 PM
Play with the threshold value. Set it to a higher value (2GB)
Created 09-19-2016 02:39 PM
If you have HDFS block size set to 2MB, then split size will also be 2MB. These 2 entities are connected.
Created 09-19-2016 03:53 PM
Split size is not equivalent to block size. Split size is configurable and its advisable that split size should be greater than block size and splits will always be done for reducing the no.of mapper tasks.
Created 09-19-2016 04:22 PM
@srinivasa rao you are seeing 9 mapper due to tezsplitgrouper which actually groups the no of original splits for better parallelism,this is a nice article explaining how initial task parallelism works https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works
Created 09-19-2016 05:20 PM
I guess you read about when you perform a "select * from <tablename>", Hive fetches the whole data from file as a FetchTask rather than a mapreduce task which just dumps the data as it is without doing anything on it, similar to "hadoop dfs -text <filename>"
However, the above does not take advantage of the true parallelism. In your case, for 1 GB will not make the difference, but image a 100 TB table and you do use a single threaded task in a cluster with 1000 nodes. FetchTask is not a good use of parallelism. Tez provides some options to split the data set to allow true parallelism.
tez.grouping.max-size and tez.grouping.min-size are split parameters.
If any of the responses was helpful, please don't forget to vote/accept the answer.