Hello,
I am looking for storage solution for a CDC Data River. The data will be periodically dumped from DB into the storage (eg. HDFS) and passed through the storage from one application into another. There might be multiple applications that will process the data in sequence.
The users might want to query the data at each step. They will be mostly interested in the recent state of the data, triggering simple queries, but they might also want to review a state of data as of some date in the past.
To reduce the load on applications, I consider processing only the `diff` of data from the previous run (since I have multiple snapshots of a same table, I expect they might not differ too much).
I also consider a Near Real Time flow, in which I can get data in a latency of minutes.
Looking at solutions available today, I found Apache Hudi and Databricks Delta matching the requirements closely.
I would like to know if Hortonworks distribution contains some competitor tool I haven't captured so far.
Thank you!