Member since
02-21-2023
4
Posts
0
Kudos Received
0
Solutions
03-10-2023
12:07 AM
For the on-prem cloudera setups where data are stored on local disks, we add a tag mentioning that this disk will for archive and assign HDFS policies to do so. But, on Cloud since HDFS is advised to use as a temporary place and data are mostly stored on cloud object storage. Also, for both Hive managed and external tables the location is cloud object storage. So, is the way to do the archive is just by applying the archive policies on the object storage? Cloudera's docs only mention the on-prem way which is additional disk. Archive link: https://docs.cloudera.com/runtime/7.2.10/scaling-namespaces/topics/hdfs-configure-archival-storage.html
... View more
Labels:
- Labels:
-
Cloudera Data Platform (CDP)
-
HDFS
03-02-2023
03:13 AM
On Cloudera Public Cloud the storage unit is GCS. Therefore, hive tables and any inserted data are stored on GCS rather than local disks ( like as on-prem ). But, this adds an overhead because when running a job, the node manager needs to get the data from gcs and cache it locally till the job is done. This eventually hits the performance on the env. Is there any way to move the data location from GCS to Local disks on Public Cloud clusters? As written on the docs, datahub hdfs/disks spaces are temporary places but I would take this risk in favor of performance. @steven-matison would really appreciate your help on this question. Thanks in advance.
... View more
Labels:
02-23-2023
02:07 AM
@steven-matison Thank you for your explanation. I came across some resources that mention the HDFS service inside the datalake is used for backups kinda. Bu apart from that, it's kinda idle disk resources since it's using 1.4TB SSD disks and they are mostly empty. I see vertical scaling for datalake resources here but it's not supported for GCP. It's kinda a must to go with medium scale since it's the recommended for production env by Cloudera. Apart from the above, is there any recommendations for the HDFS disk sizes inside the datahub? how much it should be?
... View more
02-21-2023
02:21 AM
To get started with CDP public cloud we have to create a env that is named as datalake and if we wanna run any jobs/workload on top of it we have to create datahubs based on the services we need. Inside the datahub cluster, there is an hdfs service that serves as intermediate place to store data while working on to be eventually stored/moved to cloud object storage. But there is another hdfs service inside the dataleke? what is usage of it? it's total capacity of 1.4TB SSD, how can we utilize it?
... View more
Labels:
- Labels:
-
Cloudera Data Platform (CDP)
-
HDFS