Created 03-06-2017 06:57 AM
Assume updates, inserts and deletes are replicated into Hadoop file system in a flat text files via ORACLE GoldenGate and there is a partition folder for each day. We have an external Hive table based on the files ingested, that can be used to run queries on it. As far as getting the upto date data is concerned, this method is fine. However, we have an issue with this:
1) The files keeps growing and this will increase the storage utilization, potentially cause the capacity problems
2) As more files to scan, the Hive query performance will decrease eventually. Also utilize additional resources for the map reduce jobs.
The above asks for a Compaction, unfortunately this is something not addressed by GoldenGate. I see that the Hive internal compaction (minor/major) supports only on ORC format and that external tables cannot be made ACID tables since the changes on external tables are beyond the control of the compactor. It appears to me that custom compaction process is our best option.
I do not want to use GoldenGate - Kafka integration (with Log compaction feature) as this requires to retrieving the data from stream each time to build the latest image of the table and making it accessible via Hive.
Can someone please advise on the custom compaction and if there are any better alternatives using the above GoldenGate replication?
Created 06-15-2017 10:52 AM
Hello Kumar,
Configuring the HDFS Handler to write to many HDFS files due to many source replication tables or extensive use of partitioning can result in degraded performance. Oracle GoldenGate does not support DDL replication for all database implementations. You should consult the Oracle GoldenGate documentation for their database implementation to understand if DDL replication is supported. This may not be an issue if you have a steady stream of Replication data and do not require low levels of latency for analytic data from HDFS.
Thanks.
Created 06-15-2017 10:52 AM
Hello Kumar,
Configuring the HDFS Handler to write to many HDFS files due to many source replication tables or extensive use of partitioning can result in degraded performance. Oracle GoldenGate does not support DDL replication for all database implementations. You should consult the Oracle GoldenGate documentation for their database implementation to understand if DDL replication is supported. This may not be an issue if you have a steady stream of Replication data and do not require low levels of latency for analytic data from HDFS.
Thanks.