Member since
02-21-2023
4
Posts
0
Kudos Received
0
Solutions
03-10-2023
12:07 AM
For the on-prem cloudera setups where data are stored on local disks, we add a tag mentioning that this disk will for archive and assign HDFS policies to do so. But, on Cloud since HDFS is advised to use as a temporary place and data are mostly stored on cloud object storage. Also, for both Hive managed and external tables the location is cloud object storage. So, is the way to do the archive is just by applying the archive policies on the object storage? Cloudera's docs only mention the on-prem way which is additional disk. Archive link: https://docs.cloudera.com/runtime/7.2.10/scaling-namespaces/topics/hdfs-configure-archival-storage.html
... View more
Labels:
- Labels:
-
Cloudera Data Platform (CDP)
-
HDFS
03-02-2023
06:44 AM
1 Kudo
@fahed What you see with the CDP Public Cloud Data Hubs using GCS (or object store) is a modernization of the platform around object storage. This removes differences across aws, azure, and on-prem (when Ozone is used). It is a change by customer demand so that workloads are able to be built and deployed with minimal changes from on prem to cloud or cloud to cloud. Unfortunately that creates a difference you describe above, but those are risks we are willing to take ourselves in favor of modern data architecture. If you are looking for performance, you should take a look at some of the newer options for databases: impala and kudu (this one uses local disk). Also we have Iceberg coming into this space too.
... View more
02-23-2023
05:09 AM
1 Kudo
@fahed That size is to be able to grow and serve in production manner. At first that disk usage could be low. For DataHubs, My recommendation is to start small and grow as needed. Most of your work load data should be in object store(s) for the data hubs, so dont think of that "hdfs" disk as being size constrained to initial creations of the hub.
... View more