Hive Multiple Small Files


Hi ,

I have large csv files which arrives Hadoop on a daily basis.(10GB). 1 file per day. I have a Hive external table and point it to the files (No partitions / No ORC) - Table1. I have another table Table2(external table + ORC-ZLIB) partitioned by date(yyyy-mm-dd) loaded from Table1 using insert into Table2 partition(columnname) select * from Table1 with hive.exec.dynamic.partition = true enabled. The daily files once compressed via ORC comes to <10MB(this was a surprise to me looking at the compression ratio). I have read about the multiple small file problems in Hadoop from the HW community.

Is there any additional settings in Hive / considerations to be in place so that we don't run into performance issues caused by the multiple small files?




Super Collaborator

@nikkie_thomas You can set below if you are using Tez


set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.smallfiles.avgsize=<some value>;
set hive.merge.size.per.task=<some value>;
set hive.merge.tezfiles=true;