Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Hive:Partitions:Small Files:Concatenate

Hi,

I am trying to concatenate the small files in the hive partitions. But I found a strange behavior while I am doing so.

I have many files under yyyy=2018, mm=7, dd=11 partition. When I tried to run the below query:

alter table dbname.tblName partition (yyyy=2018, mm=7, dd=11) concatenate;

all the small files got concatenated into 2 big files. I want to see if I can able to concatenate further to make it as single file.

Strangely, It didn't convert the 2 files into 1 file in the first run. After running the same query for 4 times, it converted into 1 single big file. I didn't understand this behavior.

alter table dbname.tblName partition (yyyy=2018, mm=7, dd=11) concatenate;
INFO  : Session is already open
INFO  : Dag name: hive_
INFO  : Status: Running (Executing on YARN cluster with App id AppID)
INFO  : Loading data to table dbname.tblName partition (yyyy=2018, mm=7, dd=11) from /apps/hive/warehouse/dbname.db/tblName/yyyy=2018/mm=7/dd=11/.hive-staging_hive_2018-08-16_21-27-52_556_544797697765237034-149145/-ext-10000
INFO  : Partition dbname.tblName{yyyy=2018, mm=7, dd=11} stats: [numFiles=2, numRows=74319, totalSize=1629690, rawDataSize=80710514]
No rows affected (5.008 seconds)




alter table dbname.tblName partition (yyyy=2018, mm=7, dd=11) concatenate;
INFO  : Session is already open
INFO  : Dag name: hive_
INFO  : Status: Running (Executing on YARN cluster with App id AppID)
INFO  : Loading data to table dbname.tblName partition (yyyy=2018, mm=7, dd=11) from /apps/hive/warehouse/dbname.db/tblName/yyyy=2018/mm=7/dd=11/.hive-staging_hive_2018-08-16_21-27-58_733_1348505315688040528-149145/-ext-10000
INFO  : Partition dbname.tblName{yyyy=2018, mm=7, dd=11} stats: [numFiles=2, numRows=74319, totalSize=1629690, rawDataSize=80710514]
No rows affected (1.289 seconds)
 
 
 
alter table dbname.tblName partition (yyyy=2018, mm=7, dd=11) concatenate;
INFO  : Session is already open
INFO  : Dag name: hive_
INFO  : Status: Running (Executing on YARN cluster with App id AppID)
INFO  : Loading data to table dbname.tblName partition (yyyy=2018, mm=7, dd=11) from /apps/hive/warehouse/dbname.db/tblName/yyyy=2018/mm=7/dd=11/.hive-staging_hive_2018-08-16_21-28-01_294_168641035365555493-149145/-ext-10000
INFO  : Partition dbname.tblName{yyyy=2018, mm=7, dd=11} stats: [numFiles=2, numRows=74319, totalSize=1629690, rawDataSize=80710514]
No rows affected (2.368 seconds)
 
 


 alter table dbname.tblName partition (yyyy=2018, mm=7, dd=11) concatenate;
INFO  : Session is already open
INFO  : Dag name: hive_
INFO  : Status: Running (Executing on YARN cluster with App id AppID)


INFO  : Loading data to table dbname.tblName partition (yyyy=2018, mm=7, dd=11) from /apps/hive/warehouse/dbname.db/tblName/yyyy=2018/mm=7/dd=11/.hive-staging_hive_2018-08-16_21-28-04_876_2200942119932282933-149145/-ext-10000
INFO  : Partition dbname.tblName{yyyy=2018, mm=7, dd=11} stats: [numFiles=1, numRows=74319, totalSize=1628545, rawDataSize=80710514]
No rows affected (0.877 seconds)


Can anyone throw some light into this?

4 REPLIES 4

Expert Contributor

Concatenation depends on which files are chosen first. The ordering of the files not deterministic with CombineHiveInputFormat, since grouping happens at hadoop layer

Concatenation will split or combine files based on orc file size > or < maxSplitSize.

for eg., say if you have 5 files.. 64MB, 64MB, 64MB, 64MB, 512MB & mapreduce.input.fileinputformat.split.minsize=256mb

this can result in 2 files 256MB, 512MB.. or it may result in 3 files 256MB, 256MB, 256MB.

I raised a jira for the same

Easy solution for this would be to add a path filter to skip files > maxSplitSize.

@Naresh P R : thanks Naresh. Can you show me how to add path filter to skip files > maxSplitSize

Expert Contributor

I am thinking of solution for the jira.. This needs to be implemented in code. There is no config to do this for now.

Contributor

Can maxSplitSize be set globally for the cluster to allow for a size large enough to combine those two files?

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.