Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hive:Partitions:Small Files:Concatenate

Highlighted

Hive:Partitions:Small Files:Concatenate

New Contributor

Hi,

I am trying to concatenate the small files in the hive partitions. But I found a strange behavior while I am doing so.

I have many files under yyyy=2018, mm=7, dd=11 partition. When I tried to run the below query:

alter table dbname.tblName partition (yyyy=2018, mm=7, dd=11) concatenate;

all the small files got concatenated into 2 big files. I want to see if I can able to concatenate further to make it as single file.

Strangely, It didn't convert the 2 files into 1 file in the first run. After running the same query for 4 times, it converted into 1 single big file. I didn't understand this behavior.

alter table dbname.tblName partition (yyyy=2018, mm=7, dd=11) concatenate;
INFO  : Session is already open
INFO  : Dag name: hive_
INFO  : Status: Running (Executing on YARN cluster with App id AppID)
INFO  : Loading data to table dbname.tblName partition (yyyy=2018, mm=7, dd=11) from /apps/hive/warehouse/dbname.db/tblName/yyyy=2018/mm=7/dd=11/.hive-staging_hive_2018-08-16_21-27-52_556_544797697765237034-149145/-ext-10000
INFO  : Partition dbname.tblName{yyyy=2018, mm=7, dd=11} stats: [numFiles=2, numRows=74319, totalSize=1629690, rawDataSize=80710514]
No rows affected (5.008 seconds)




alter table dbname.tblName partition (yyyy=2018, mm=7, dd=11) concatenate;
INFO  : Session is already open
INFO  : Dag name: hive_
INFO  : Status: Running (Executing on YARN cluster with App id AppID)
INFO  : Loading data to table dbname.tblName partition (yyyy=2018, mm=7, dd=11) from /apps/hive/warehouse/dbname.db/tblName/yyyy=2018/mm=7/dd=11/.hive-staging_hive_2018-08-16_21-27-58_733_1348505315688040528-149145/-ext-10000
INFO  : Partition dbname.tblName{yyyy=2018, mm=7, dd=11} stats: [numFiles=2, numRows=74319, totalSize=1629690, rawDataSize=80710514]
No rows affected (1.289 seconds)
 
 
 
alter table dbname.tblName partition (yyyy=2018, mm=7, dd=11) concatenate;
INFO  : Session is already open
INFO  : Dag name: hive_
INFO  : Status: Running (Executing on YARN cluster with App id AppID)
INFO  : Loading data to table dbname.tblName partition (yyyy=2018, mm=7, dd=11) from /apps/hive/warehouse/dbname.db/tblName/yyyy=2018/mm=7/dd=11/.hive-staging_hive_2018-08-16_21-28-01_294_168641035365555493-149145/-ext-10000
INFO  : Partition dbname.tblName{yyyy=2018, mm=7, dd=11} stats: [numFiles=2, numRows=74319, totalSize=1629690, rawDataSize=80710514]
No rows affected (2.368 seconds)
 
 


 alter table dbname.tblName partition (yyyy=2018, mm=7, dd=11) concatenate;
INFO  : Session is already open
INFO  : Dag name: hive_
INFO  : Status: Running (Executing on YARN cluster with App id AppID)


INFO  : Loading data to table dbname.tblName partition (yyyy=2018, mm=7, dd=11) from /apps/hive/warehouse/dbname.db/tblName/yyyy=2018/mm=7/dd=11/.hive-staging_hive_2018-08-16_21-28-04_876_2200942119932282933-149145/-ext-10000
INFO  : Partition dbname.tblName{yyyy=2018, mm=7, dd=11} stats: [numFiles=1, numRows=74319, totalSize=1628545, rawDataSize=80710514]
No rows affected (0.877 seconds)


Can anyone throw some light into this?

4 REPLIES 4

Re: Hive:Partitions:Small Files:Concatenate

Expert Contributor

Concatenation depends on which files are chosen first. The ordering of the files not deterministic with CombineHiveInputFormat, since grouping happens at hadoop layer

Concatenation will split or combine files based on orc file size > or < maxSplitSize.

for eg., say if you have 5 files.. 64MB, 64MB, 64MB, 64MB, 512MB & mapreduce.input.fileinputformat.split.minsize=256mb

this can result in 2 files 256MB, 512MB.. or it may result in 3 files 256MB, 256MB, 256MB.

I raised a jira for the same

Easy solution for this would be to add a path filter to skip files > maxSplitSize.

Re: Hive:Partitions:Small Files:Concatenate

New Contributor

@Naresh P R : thanks Naresh. Can you show me how to add path filter to skip files > maxSplitSize

Re: Hive:Partitions:Small Files:Concatenate

Expert Contributor

I am thinking of solution for the jira.. This needs to be implemented in code. There is no config to do this for now.

Re: Hive:Partitions:Small Files:Concatenate

Contributor

Can maxSplitSize be set globally for the cluster to allow for a size large enough to combine those two files?