I created an ORC table in Hive (saved in HDFS path /apps/hive/warehouse/mydb.db/mytable). As I need to add some rows manually sometimes, I call the INSERT statement. This creates many small files in the table directory in HDFS, which is the expected behavior.
Now I run the command
ALTER TABLE mydb.mytable CONCATENATE;
to merge these small files together to bigger files. What I'm observing here is, that sometimes all small files are merged to one big file (~80 MB) and sometimes I have the big file and some small files (with a few KB each) over, they seem not to be merged.
Is this normal behavior of the CONCATENATE command? Is there a way to influence this behavior (to avoid having these small files sometimes after the Concatenate command)?
When merging the Hive ORC files, instead of matching the files wrt Block size, the files are merge as per the ORC stripe size.
The property which controls this is hive.merge.orcfile.stripe.level. When the property is set to true, the merge happens at stripe level and when set to false, the files are merge at file level. Parameters which affect the file level merge are:
hive.merge.tezfiles=true hive.merge.mapfiles=true hive.merge.size.per.task=256000000 hive.merge.smallfiles.avgsize=16000000
For more details refer link.
Also, there are some known limitations related to concatenation. Do observe the behaviour and file count when the concatenate is run in say 5 iterations.