We have an application that inserts data into Hive every 5 minutes.
The data is inserted into HIVE EXTERNAL tables which are located on S3.
The data is in ORC format.
The data inserted every 5 minutes is not big in size ( <1 MB)
All jobs use Tez sessions.
Due to this frequency of inserts and the small size of the data, there are a lot of small files created in the partitions.
1. Is there a way to merge these small files in place(without moving to other tables)?
2. How can these small files be avoided for future inserts?