When the property skip.header.line.count is enabled on tables with large text files (~100G), we can see that the
tez.grouping* parameters are ineffective, spinning one container per file.
We have a table with 10 files,
117.0 G A.txt
29.0 G B.txt
30.0 G C.txt
1.9 G D.txt
11.0 G E.txt
18.9 G F.txt
2.4 G I.txt
214.4 M J.txt
Created table on top of this data, using
CREATE TABLE `tab1`(
Set these params (below), after which expectation is at least 200G/0.5G= 400 containers should be launched.
With tez as the execution engine an attempt to perform select count(*) on the table spins only 10 containers.
remove the skip.header.line.count', it launched about 540 containers, which is expected behavior.
ROOT CAUSE :
hive.tez.input format was set to default org.apache.hadoop.hive.ql.io.HiveInputFormat;
HiveInputFormat is a parameterized InputFormat which looks at the path name and determine the correct InputFormat for
that path name from mapredPlan.pathToPartitionInfo(). It can be used to read files with different input format in the same
CombineInputFormat combine all small files to generate a split.