Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
Labels (1)
avatar
Contributor

PROBLEM :

When the property skip.header.line.count is enabled on tables with large text files (~100G), we can see that the

tez.grouping* parameters are ineffective, spinning one container per file.

EXAMPLE: We have a table with 10 files, 117.0 G A.txt 29.0 G B.txt 30.0 G C.txt 1.9 G D.txt 11.0 G E.txt 18.9 G F.txt 5.7 G

G.txt 159.4 H.txt 2.4 G I.txt 214.4 M J.txt

Created table on top of this data, using

CREATE TABLE `tab1`( 
`opr_glbl_sku_id` string, 
`opr_loc_nbr` string, 
`wk_start_dt` string, 
`base_fcst` float) 
...
  
TBLPROPERTIES ( 
'skip.header.line.count'='1') 

Set these params (below), after which expectation is at least 200G/0.5G= 400 containers should be launched.

SET tez.grouping.min-size=53421772; SET tez.grouping.max-size=534217728;

With tez as the execution engine an attempt to perform select count(*) on the table spins only 10 containers. Once we

remove the skip.header.line.count', it launched about 540 containers, which is expected behavior.

ROOT CAUSE : hive.tez.input format was set to default org.apache.hadoop.hive.ql.io.HiveInputFormat;

HiveInputFormat is a parameterized InputFormat which looks at the path name and determine the correct InputFormat for

that path name from mapredPlan.pathToPartitionInfo(). It can be used to read files with different input format in the same

map-reduce job.

RESOLUTION : CombineInputFormat combine all small files to generate a split. hive.tez.input

format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

1,108 Views