I am looking for storage solution for a CDC Data River. The data will be periodically dumped from DB into the storage (eg. HDFS) and passed through the storage from one application into another. There might be multiple applications that will process the data in sequence.
The users might want to query the data at each step. They will be mostly interested in the recent state of the data, triggering simple queries, but they might also want to review a state of data as of some date in the past.
To reduce the load on applications, I consider processing only the `diff` of data from the previous run (since I have multiple snapshots of a same table, I expect they might not differ too much).
I also consider a Near Real Time flow, in which I can get data in a latency of minutes.
Looking at solutions available today, I found Apache Hudi and Databricks Delta matching the requirements closely.
I would like to know if Hortonworks distribution contains some competitor tool I haven't captured so far.
Here are my thoughts around competitors from Hortonworks..
Using Hive Transactional tables:
1.if we are getting full dump every time then you can try with Hive-Merge functionality(only in hortonworks) which data availability will be in less than a minute(depends on how much data we scanning and cluster resources..etc).
2.If you are thinking about only the latest version of each record then by Using Hbase we can handle all updates(but scanning a non row key will not give you any performance), use Phoenix on top HBase to get SQL on top of Nosql table.
Both approaches will server for updating the existing data and available only the latest version of the record.
Refer to this and this links about more details about these approaches.