Support Questions

Find answers, ask questions, and share your expertise

CDP Public Cloud Datalake HDFS usage

avatar
Explorer

To get started with CDP public cloud we have to create a env that is named as datalake and if we wanna run any jobs/workload on top of it we have to create datahubs based on the services we need. 

 

Inside the datahub cluster, there is an hdfs service that serves as intermediate place to store data while working on to be eventually stored/moved to cloud object storage. But there is another hdfs service inside the dataleke? what is usage of it? it's total capacity of 1.4TB SSD, how can we utilize it?

1 ACCEPTED SOLUTION

avatar

@fahed The HDFS Service inside of the DataLake is supporting of the environment, and its services. For example:  Atlas. Ranger, Solr, Hbase.  It's size, is based on the environment scale.

 

You are correct in the assumption that your end user HDFS Service is part of Data Hubs deployed around the environment.  You should not try to use the environment's HDFS Service for applications and workloads that would be part of deeper Data Hubs.

View solution in original post

3 REPLIES 3

avatar

@fahed The HDFS Service inside of the DataLake is supporting of the environment, and its services. For example:  Atlas. Ranger, Solr, Hbase.  It's size, is based on the environment scale.

 

You are correct in the assumption that your end user HDFS Service is part of Data Hubs deployed around the environment.  You should not try to use the environment's HDFS Service for applications and workloads that would be part of deeper Data Hubs.

avatar
Explorer

@steven-matison  Thank you for your explanation. I came across some resources that mention the HDFS service inside the datalake is used for backups kinda. Bu apart from that, it's kinda idle disk resources since it's using 1.4TB SSD disks and they are mostly empty.  I see vertical scaling for datalake resources here but it's not supported for GCP. It's kinda a must to go with medium scale since it's the recommended for production env by Cloudera. 
Apart from the above, is there any recommendations for the HDFS disk sizes inside the datahub? how much it should be? 


avatar

@fahed  That size is to be able to grow and serve in production manner.  At first that disk usage could be low.   

 

For DataHubs,  My recommendation is to start small and grow as needed.   Most of your work load data should be in object store(s) for the data hubs,  so dont think of that "hdfs" disk as being size constrained to initial creations of the hub.