Created 08-17-2018 03:42 AM
Hi,
I am trying to concatenate the small files in the hive partitions. But I found a strange behavior while I am doing so.
I have many files under yyyy=2018, mm=7, dd=11 partition. When I tried to run the below query:
alter table dbname.tblName partition (yyyy=2018, mm=7, dd=11) concatenate;
all the small files got concatenated into 2 big files. I want to see if I can able to concatenate further to make it as single file.
Strangely, It didn't convert the 2 files into 1 file in the first run. After running the same query for 4 times, it converted into 1 single big file. I didn't understand this behavior.
alter table dbname.tblName partition (yyyy=2018, mm=7, dd=11) concatenate; INFO : Session is already open INFO : Dag name: hive_ INFO : Status: Running (Executing on YARN cluster with App id AppID) INFO : Loading data to table dbname.tblName partition (yyyy=2018, mm=7, dd=11) from /apps/hive/warehouse/dbname.db/tblName/yyyy=2018/mm=7/dd=11/.hive-staging_hive_2018-08-16_21-27-52_556_544797697765237034-149145/-ext-10000 INFO : Partition dbname.tblName{yyyy=2018, mm=7, dd=11} stats: [numFiles=2, numRows=74319, totalSize=1629690, rawDataSize=80710514] No rows affected (5.008 seconds) alter table dbname.tblName partition (yyyy=2018, mm=7, dd=11) concatenate; INFO : Session is already open INFO : Dag name: hive_ INFO : Status: Running (Executing on YARN cluster with App id AppID) INFO : Loading data to table dbname.tblName partition (yyyy=2018, mm=7, dd=11) from /apps/hive/warehouse/dbname.db/tblName/yyyy=2018/mm=7/dd=11/.hive-staging_hive_2018-08-16_21-27-58_733_1348505315688040528-149145/-ext-10000 INFO : Partition dbname.tblName{yyyy=2018, mm=7, dd=11} stats: [numFiles=2, numRows=74319, totalSize=1629690, rawDataSize=80710514] No rows affected (1.289 seconds) alter table dbname.tblName partition (yyyy=2018, mm=7, dd=11) concatenate; INFO : Session is already open INFO : Dag name: hive_ INFO : Status: Running (Executing on YARN cluster with App id AppID) INFO : Loading data to table dbname.tblName partition (yyyy=2018, mm=7, dd=11) from /apps/hive/warehouse/dbname.db/tblName/yyyy=2018/mm=7/dd=11/.hive-staging_hive_2018-08-16_21-28-01_294_168641035365555493-149145/-ext-10000 INFO : Partition dbname.tblName{yyyy=2018, mm=7, dd=11} stats: [numFiles=2, numRows=74319, totalSize=1629690, rawDataSize=80710514] No rows affected (2.368 seconds) alter table dbname.tblName partition (yyyy=2018, mm=7, dd=11) concatenate; INFO : Session is already open INFO : Dag name: hive_ INFO : Status: Running (Executing on YARN cluster with App id AppID) INFO : Loading data to table dbname.tblName partition (yyyy=2018, mm=7, dd=11) from /apps/hive/warehouse/dbname.db/tblName/yyyy=2018/mm=7/dd=11/.hive-staging_hive_2018-08-16_21-28-04_876_2200942119932282933-149145/-ext-10000 INFO : Partition dbname.tblName{yyyy=2018, mm=7, dd=11} stats: [numFiles=1, numRows=74319, totalSize=1628545, rawDataSize=80710514] No rows affected (0.877 seconds)
Can anyone throw some light into this?
Created 08-17-2018 07:09 AM
Concatenation depends on which files are chosen first. The ordering of the files not deterministic with CombineHiveInputFormat, since grouping happens at hadoop layer
Concatenation will split or combine files based on orc file size > or < maxSplitSize.
for eg., say if you have 5 files.. 64MB, 64MB, 64MB, 64MB, 512MB & mapreduce.input.fileinputformat.split.minsize=256mb
this can result in 2 files 256MB, 512MB.. or it may result in 3 files 256MB, 256MB, 256MB.
I raised a jira for the same
Easy solution for this would be to add a path filter to skip files > maxSplitSize.
Created 08-17-2018 02:31 PM
@Naresh P R : thanks Naresh. Can you show me how to add path filter to skip files > maxSplitSize
Created 08-17-2018 03:01 PM
I am thinking of solution for the jira.. This needs to be implemented in code. There is no config to do this for now.
Created 08-17-2018 05:20 PM
Can maxSplitSize be set globally for the cluster to allow for a size large enough to combine those two files?