Created on 07-28-2020 11:33 AM - edited 09-16-2022 01:45 AM
The Cloudera Data Platform (CDP) comes with many places to store your data, and it can be challenging to know which one to use. Though there is no formal decision tree, I hereby share the key considerations from my personal perspective. They can be visualized like this:
The exact kind of storage to be used will mostly be defined by your environment, in a classical cluster HDFS is available. In the public cloud, each provider object store will be leveraged, and on-premises Ozone will serve as the object-store.
If you want to work with a table, and need to store it as such, it is clear you want to store your data as a table. Even if this may force you to think about how to implement the ingest in a sensible way. Kudu is great for fast insights, where hive tables (which in turn can be of different formats) can offer an unlimited scale. Note that Hive tables (registered in the Hive Metastore) can be accessed via different means, including the Hive engine and the Impala engine.
Druid is able to aggregate data upon ingestion.
Kafka and Hbase are both great places to put 'many tiny things', for instance, individual transactions. Kafka offers great throughput and latency, but despite commonly used marketing messages, it is not a database and does not scale well for historical data. If you want to serve data granularly for a longer period of time, Hbase is a great fit for this.
Also, see my related article: Find the right tool to move your data
Full Disclosure & Disclaimer:
I am an Employee of Cloudera, but this is not part of the formal documentation of the Cloudera Data platform. It is purely based on my own experience of advising people in their choice of tooling.
Created on 07-29-2020 08:01 AM
This is a great decision chart.
I would add Flink SQL for querying events in stream and for querying Kafka topics.
If you have Time Series data definitely use Druid, if your data is not timeseries or timestamp driven do not use Druid, use Kudu instead.
https://druid.apache.org/docs/latest/comparisons/druid-vs-kudu.html
https://druid.apache.org/docs/latest/comparisons/druid-vs-key-value.html
Created on 07-29-2020 02:27 PM - edited 07-29-2020 02:28 PM
Thanks, will think on refining the distinction between kudu and druid.
Currently i would not want to include the fact that flink has state as 'storage', but regarding flink SQL, i may actually make another post later to talk about the way to interact with/access different kinds of data. (As someone also noticed, impala is also not here because it is not a store in itself, but works with stored data).