- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Delta alternative
- Labels:
-
Apache Hive
Created ‎04-25-2019 05:37 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am looking for storage solution for a CDC Data River. The data will be periodically dumped from DB into the storage (eg. HDFS) and passed through the storage from one application into another. There might be multiple applications that will process the data in sequence.
The users might want to query the data at each step. They will be mostly interested in the recent state of the data, triggering simple queries, but they might also want to review a state of data as of some date in the past.
To reduce the load on applications, I consider processing only the `diff` of data from the previous run (since I have multiple snapshots of a same table, I expect they might not differ too much).
I also consider a Near Real Time flow, in which I can get data in a latency of minutes.
Looking at solutions available today, I found Apache Hudi and Databricks Delta matching the requirements closely.
I would like to know if Hortonworks distribution contains some competitor tool I haven't captured so far.
Thank you!
Created ‎04-26-2019 01:34 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here are my thoughts around competitors from Hortonworks..
Using Hive Transactional tables:
1.if we are getting full dump every time then you can try with Hive-Merge functionality(only in hortonworks) which data availability will be in less than a minute(depends on how much data we scanning and cluster resources..etc).
Using HBase:
2.If you are thinking about only the latest version of each record then by Using Hbase we can handle all updates(but scanning a non row key will not give you any performance), use Phoenix on top HBase to get SQL on top of Nosql table.
-
Both approaches will server for updating the existing data and available only the latest version of the record.
-
Refer to this and this links about more details about these approaches.
Using Druid:
Refer to this link for druid.-
It would be great if you comment out which way performed better (or) you have chosen for this case 🙂
