Assume updates, inserts and deletes are replicated into Hadoop file system in a flat text files via ORACLE GoldenGate and there is a partition folder for each day. We have an external Hive table based on the files ingested, that can be used to run queries on it. As far as getting the upto date data is concerned, this method is fine. However, we have an issue with this:
1) The files keeps growing and this will increase the storage utilization, potentially cause the capacity problems
2) As more files to scan, the Hive query performance will decrease eventually. Also utilize additional resources for the map reduce jobs.
The above asks for a Compaction, unfortunately this is something not addressed by GoldenGate. I see that the Hive internal compaction (minor/major) supports only on ORC format and that external tables cannot be made ACID tables since the changes on external tables are beyond the control of the compactor. It appears to me that custom compaction process is our best option.
I do not want to use GoldenGate - Kafka integration (with Log compaction feature) as this requires to retrieving the data from stream each time to build the latest image of the table and making it accessible via Hive.
Can someone please advise on the custom compaction and if there are any better alternatives using the above GoldenGate replication?