Support Questions
Find answers, ask questions, and share your expertise

How to merge small ORC files under a HIVE EXTERNAL table after the insert is complete ?

How to merge small ORC files under a HIVE EXTERNAL table after the insert is complete ?

Hi All,

 

We have an application that inserts data into Hive every 5 minutes.

The data is inserted into HIVE EXTERNAL tables which are located on S3.

The data is in ORC format.

The data inserted every 5 minutes is not big in size ( <1 MB)

All jobs use Tez sessions.

 

Due to this frequency of inserts and the small size of the data, there are a lot of small files created in the partitions. 

 

Questions:

1. Is there a way to merge these small files in place(without moving to other tables)?

2. How can these small files be avoided for future inserts?

 

Configuration:

hive.merge.mapfiles=true

hive.merge.mapredfiles=true

hive.merge.tezfiles=true

 

Versions:

HDP-2.6.5.1175

Hive 1.2.1000