Created on 07-28-202011:33 AM - edited 09-16-202201:45 AM
The Cloudera Data Platform (CDP) comes with many places to store your data, and it can be challenging to know which one to use. Though there is no formal decision tree, I hereby share the key considerations from my personal perspective. They can be visualized like this:
Explanation of each path
a. Have large bulky files, that do not need to be queried > File and object storage
The exact kind of storage to be used will mostly be defined by your environment, in a classical cluster HDFS is available. In the public cloud, each provider object store will be leveraged, and on-premises Ozone will serve as the object-store.
b. Have a table, either from large bulky files, or a set of messages > Hive for scale or Kudu for interaction
If you want to work with a table, and need to store it as such, it is clear you want to store your data as a table. Even if this may force you to think about how to implement the ingest in a sensible way. Kudu is great for fast insights, where hive tables (which in turn can be of different formats) can offer an unlimited scale. Note that Hive tables (registered in the Hive Metastore) can be accessed via different means, including the Hive engine and the Impala engine.
c. Does your table records stream in, but you only need pre-aggregates > Druid
Druid is able to aggregate data upon ingestion.
d. Are you working with messages or small files > Kafka for latency or HBase for retention
Kafka and Hbase are both great places to put 'many tiny things', for instance, individual transactions. Kafka offers great throughput and latency, but despite commonly used marketing messages, it is not a database and does not scale well for historical data. If you want to serve data granularly for a longer period of time, Hbase is a great fit for this.
When working in the cloud, it is often desirable to work with object stores where possible to keep costs down. The good news is that CDP Public Cloud comes with cloud native capabilities. As such several storage solutions, such as Hive, actually store the data in cloud object stores.
It is possible that more than one road applies to your data. For instance, a message may require very low latency in the first few days, but also needs to be retained for several years. In such cases, it often makes sense to store a subset of the data in two places.
I did not include other solutions that could store data, such as solr, or in-application state. The reason is that the primary function of these is not storage, but search and processing respectively. I also did not include Impala as it is an engine, Hive is only on this chart to represent its storage capabilities.
This is a basic decision tree, it should cover most situations but do not hesitate to deviate if your situations ask for this.