Support Questions

hadoopdataanaly · ‎08-17-2018

Hi,

I am trying to concatenate the small files in the hive partitions. But I found a strange behavior while I am doing so.

I have many files under yyyy=2018, mm=7, dd=11 partition. When I tried to run the below query:

alter table dbname.tblName partition (yyyy=2018, mm=7, dd=11) concatenate;

all the small files got concatenated into 2 big files. I want to see if I can able to concatenate further to make it as single file.

Strangely, It didn't convert the 2 files into 1 file in the first run. After running the same query for 4 times, it converted into 1 single big file. I didn't understand this behavior.

alter table dbname.tblName partition (yyyy=2018, mm=7, dd=11) concatenate;
INFO  : Session is already open
INFO  : Dag name: hive_
INFO  : Status: Running (Executing on YARN cluster with App id AppID)
INFO  : Loading data to table dbname.tblName partition (yyyy=2018, mm=7, dd=11) from /apps/hive/warehouse/dbname.db/tblName/yyyy=2018/mm=7/dd=11/.hive-staging_hive_2018-08-16_21-27-52_556_544797697765237034-149145/-ext-10000
INFO  : Partition dbname.tblName{yyyy=2018, mm=7, dd=11} stats: [numFiles=2, numRows=74319, totalSize=1629690, rawDataSize=80710514]
No rows affected (5.008 seconds)




alter table dbname.tblName partition (yyyy=2018, mm=7, dd=11) concatenate;
INFO  : Session is already open
INFO  : Dag name: hive_
INFO  : Status: Running (Executing on YARN cluster with App id AppID)
INFO  : Loading data to table dbname.tblName partition (yyyy=2018, mm=7, dd=11) from /apps/hive/warehouse/dbname.db/tblName/yyyy=2018/mm=7/dd=11/.hive-staging_hive_2018-08-16_21-27-58_733_1348505315688040528-149145/-ext-10000
INFO  : Partition dbname.tblName{yyyy=2018, mm=7, dd=11} stats: [numFiles=2, numRows=74319, totalSize=1629690, rawDataSize=80710514]
No rows affected (1.289 seconds)
 
 
 
alter table dbname.tblName partition (yyyy=2018, mm=7, dd=11) concatenate;
INFO  : Session is already open
INFO  : Dag name: hive_
INFO  : Status: Running (Executing on YARN cluster with App id AppID)
INFO  : Loading data to table dbname.tblName partition (yyyy=2018, mm=7, dd=11) from /apps/hive/warehouse/dbname.db/tblName/yyyy=2018/mm=7/dd=11/.hive-staging_hive_2018-08-16_21-28-01_294_168641035365555493-149145/-ext-10000
INFO  : Partition dbname.tblName{yyyy=2018, mm=7, dd=11} stats: [numFiles=2, numRows=74319, totalSize=1629690, rawDataSize=80710514]
No rows affected (2.368 seconds)
 
 


 alter table dbname.tblName partition (yyyy=2018, mm=7, dd=11) concatenate;
INFO  : Session is already open
INFO  : Dag name: hive_
INFO  : Status: Running (Executing on YARN cluster with App id AppID)


INFO  : Loading data to table dbname.tblName partition (yyyy=2018, mm=7, dd=11) from /apps/hive/warehouse/dbname.db/tblName/yyyy=2018/mm=7/dd=11/.hive-staging_hive_2018-08-16_21-28-04_876_2200942119932282933-149145/-ext-10000
INFO  : Partition dbname.tblName{yyyy=2018, mm=7, dd=11} stats: [numFiles=1, numRows=74319, totalSize=1628545, rawDataSize=80710514]
No rows affected (0.877 seconds)

Can anyone throw some light into this?

nramanaiah · ‎08-17-2018

Concatenation depends on which files are chosen first. The ordering of the files not deterministic with CombineHiveInputFormat, since grouping happens at hadoop layer

Concatenation will split or combine files based on orc file size > or < maxSplitSize.

for eg., say if you have 5 files.. 64MB, 64MB, 64MB, 64MB, 512MB & mapreduce.input.fileinputformat.split.minsize=256mb

this can result in 2 files 256MB, 512MB.. or it may result in 3 files 256MB, 256MB, 256MB.

I raised a jira for the same

Easy solution for this would be to add a path filter to skip files > maxSplitSize.

hadoopdataanaly · ‎08-17-2018

@Naresh P R : thanks Naresh. Can you show me how to add path filter to skip files > maxSplitSize

nramanaiah · ‎08-17-2018

I am thinking of solution for the jira.. This needs to be implemented in code. There is no config to do this for now.

rtheron · ‎08-17-2018

Can maxSplitSize be set globally for the cluster to allow for a size large enough to combine those two files?

Cloudera Community

Support Questions

Hive:Partitions:Small Files:Concatenate