I have large csv files which arrives Hadoop on a daily basis.(10GB). 1 file per day.
I have a Hive external table and point it to the files (No partitions / No ORC) - Table1. I have another table Table2(external table + ORC-ZLIB) partitioned by date(yyyy-mm-dd) loaded from Table1 using
insert into Table2 partition(columnname) select * from Table1 with hive.exec.dynamic.partition = true enabled.
The daily files once compressed via ORC comes to <10MB(this was a surprise to me looking at the compression ratio).
I have read about the multiple small file problems in Hadoop from the HW community.
Is there any additional settings in Hive / considerations to be in place so that
we don't run into performance issues caused by the multiple small files?