Created on 03-27-2017 08:00 PM
PROBLEM :
When the property skip.header.line.count is enabled on tables with large text files (~100G), we can see that the
tez.grouping* parameters are ineffective, spinning one container per file.
EXAMPLE: We have a table with 10 files, 117.0 G A.txt 29.0 G B.txt 30.0 G C.txt 1.9 G D.txt 11.0 G E.txt 18.9 G F.txt 5.7 G
G.txt 159.4 H.txt 2.4 G I.txt 214.4 M J.txt
Created table on top of this data, using
CREATE TABLE `tab1`( `opr_glbl_sku_id` string, `opr_loc_nbr` string, `wk_start_dt` string, `base_fcst` float) ... TBLPROPERTIES ( 'skip.header.line.count'='1')
Set these params (below), after which expectation is at least 200G/0.5G= 400 containers should be launched.
SET tez.grouping.min-size=53421772; SET tez.grouping.max-size=534217728;
With tez as the execution engine an attempt to perform select count(*) on the table spins only 10 containers. Once we
remove the skip.header.line.count', it launched about 540 containers, which is expected behavior.
ROOT CAUSE : hive.tez.input format was set to default org.apache.hadoop.hive.ql.io.HiveInputFormat;
HiveInputFormat is a parameterized InputFormat which looks at the path name and determine the correct InputFormat for
that path name from mapredPlan.pathToPartitionInfo(). It can be used to read files with different input format in the same
map-reduce job.
RESOLUTION : CombineInputFormat combine all small files to generate a split. hive.tez.input
format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;