Support Questions

Find answers, ask questions, and share your expertise

Hive:Partitions:Small Files:Concatenate

avatar

Hi,

I am trying to concatenate the small files in the hive partitions. But I found a strange behavior while I am doing so.

I have many files under yyyy=2018, mm=7, dd=11 partition. When I tried to run the below query:

alter table dbname.tblName partition (yyyy=2018, mm=7, dd=11) concatenate;

all the small files got concatenated into 2 big files. I want to see if I can able to concatenate further to make it as single file.

Strangely, It didn't convert the 2 files into 1 file in the first run. After running the same query for 4 times, it converted into 1 single big file. I didn't understand this behavior.

alter table dbname.tblName partition (yyyy=2018, mm=7, dd=11) concatenate;
INFO  : Session is already open
INFO  : Dag name: hive_
INFO  : Status: Running (Executing on YARN cluster with App id AppID)
INFO  : Loading data to table dbname.tblName partition (yyyy=2018, mm=7, dd=11) from /apps/hive/warehouse/dbname.db/tblName/yyyy=2018/mm=7/dd=11/.hive-staging_hive_2018-08-16_21-27-52_556_544797697765237034-149145/-ext-10000
INFO  : Partition dbname.tblName{yyyy=2018, mm=7, dd=11} stats: [numFiles=2, numRows=74319, totalSize=1629690, rawDataSize=80710514]
No rows affected (5.008 seconds)




alter table dbname.tblName partition (yyyy=2018, mm=7, dd=11) concatenate;
INFO  : Session is already open
INFO  : Dag name: hive_
INFO  : Status: Running (Executing on YARN cluster with App id AppID)
INFO  : Loading data to table dbname.tblName partition (yyyy=2018, mm=7, dd=11) from /apps/hive/warehouse/dbname.db/tblName/yyyy=2018/mm=7/dd=11/.hive-staging_hive_2018-08-16_21-27-58_733_1348505315688040528-149145/-ext-10000
INFO  : Partition dbname.tblName{yyyy=2018, mm=7, dd=11} stats: [numFiles=2, numRows=74319, totalSize=1629690, rawDataSize=80710514]
No rows affected (1.289 seconds)
 
 
 
alter table dbname.tblName partition (yyyy=2018, mm=7, dd=11) concatenate;
INFO  : Session is already open
INFO  : Dag name: hive_
INFO  : Status: Running (Executing on YARN cluster with App id AppID)
INFO  : Loading data to table dbname.tblName partition (yyyy=2018, mm=7, dd=11) from /apps/hive/warehouse/dbname.db/tblName/yyyy=2018/mm=7/dd=11/.hive-staging_hive_2018-08-16_21-28-01_294_168641035365555493-149145/-ext-10000
INFO  : Partition dbname.tblName{yyyy=2018, mm=7, dd=11} stats: [numFiles=2, numRows=74319, totalSize=1629690, rawDataSize=80710514]
No rows affected (2.368 seconds)
 
 


 alter table dbname.tblName partition (yyyy=2018, mm=7, dd=11) concatenate;
INFO  : Session is already open
INFO  : Dag name: hive_
INFO  : Status: Running (Executing on YARN cluster with App id AppID)


INFO  : Loading data to table dbname.tblName partition (yyyy=2018, mm=7, dd=11) from /apps/hive/warehouse/dbname.db/tblName/yyyy=2018/mm=7/dd=11/.hive-staging_hive_2018-08-16_21-28-04_876_2200942119932282933-149145/-ext-10000
INFO  : Partition dbname.tblName{yyyy=2018, mm=7, dd=11} stats: [numFiles=1, numRows=74319, totalSize=1628545, rawDataSize=80710514]
No rows affected (0.877 seconds)


Can anyone throw some light into this?

4 REPLIES 4

avatar
Expert Contributor

Concatenation depends on which files are chosen first. The ordering of the files not deterministic with CombineHiveInputFormat, since grouping happens at hadoop layer

Concatenation will split or combine files based on orc file size > or < maxSplitSize.

for eg., say if you have 5 files.. 64MB, 64MB, 64MB, 64MB, 512MB & mapreduce.input.fileinputformat.split.minsize=256mb

this can result in 2 files 256MB, 512MB.. or it may result in 3 files 256MB, 256MB, 256MB.

I raised a jira for the same

Easy solution for this would be to add a path filter to skip files > maxSplitSize.

avatar

@Naresh P R : thanks Naresh. Can you show me how to add path filter to skip files > maxSplitSize

avatar
Expert Contributor

I am thinking of solution for the jira.. This needs to be implemented in code. There is no config to do this for now.

avatar
Rising Star

Can maxSplitSize be set globally for the cluster to allow for a size large enough to combine those two files?