Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (1)
Cloudera Employee

PROBLEM :

When the property skip.header.line.count is enabled on tables with large text files (~100G), we can see that the

tez.grouping* parameters are ineffective, spinning one container per file.

EXAMPLE: We have a table with 10 files, 117.0 G A.txt 29.0 G B.txt 30.0 G C.txt 1.9 G D.txt 11.0 G E.txt 18.9 G F.txt 5.7 G

G.txt 159.4 H.txt 2.4 G I.txt 214.4 M J.txt

Created table on top of this data, using

CREATE TABLE `tab1`( 
`opr_glbl_sku_id` string, 
`opr_loc_nbr` string, 
`wk_start_dt` string, 
`base_fcst` float) 
...
  
TBLPROPERTIES ( 
'skip.header.line.count'='1') 

Set these params (below), after which expectation is at least 200G/0.5G= 400 containers should be launched.

SET tez.grouping.min-size=53421772; SET tez.grouping.max-size=534217728;

With tez as the execution engine an attempt to perform select count(*) on the table spins only 10 containers. Once we

remove the skip.header.line.count', it launched about 540 containers, which is expected behavior.

ROOT CAUSE : hive.tez.input format was set to default org.apache.hadoop.hive.ql.io.HiveInputFormat;

HiveInputFormat is a parameterized InputFormat which looks at the path name and determine the correct InputFormat for

that path name from mapredPlan.pathToPartitionInfo(). It can be used to read files with different input format in the same

map-reduce job.

RESOLUTION : CombineInputFormat combine all small files to generate a split. hive.tez.input

format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

327 Views
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
1 of 1
Last update:
‎03-27-2017 08:00 PM
Updated by:
 
Contributors
Top Kudoed Authors