Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hive CONCATENATE not always merging all small files

Highlighted

Hive CONCATENATE not always merging all small files

Expert Contributor

I created an ORC table in Hive (saved in HDFS path /apps/hive/warehouse/mydb.db/mytable). As I need to add some rows manually sometimes, I call the INSERT statement. This creates many small files in the table directory in HDFS, which is the expected behavior.

Now I run the command

ALTER TABLE mydb.mytable CONCATENATE;

to merge these small files together to bigger files. What I'm observing here is, that sometimes all small files are merged to one big file (~80 MB) and sometimes I have the big file and some small files (with a few KB each) over, they seem not to be merged.

Is this normal behavior of the CONCATENATE command? Is there a way to influence this behavior (to avoid having these small files sometimes after the Concatenate command)?

Thank you!

1 REPLY 1

Re: Hive CONCATENATE not always merging all small files

@Daniel Müller

When merging the Hive ORC files, instead of matching the files wrt Block size, the files are merge as per the ORC stripe size.

The property which controls this is hive.merge.orcfile.stripe.level. When the property is set to true, the merge happens at stripe level and when set to false, the files are merge at file level. Parameters which affect the file level merge are:

hive.merge.tezfiles=true
hive.merge.mapfiles=true
hive.merge.size.per.task=256000000
hive.merge.smallfiles.avgsize=16000000

For more details refer link.

Also, there are some known limitations related to concatenation. Do observe the behaviour and file count when the concatenate is run in say 5 iterations.