Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Changing data location from GCS to local disks on CDP public cloud

avatar
Explorer


On Cloudera Public Cloud the storage unit is GCS. Therefore, hive tables and any inserted data are stored on GCS rather than local disks ( like as on-prem ). But, this adds an overhead because when running a job, the node manager needs to get the data from gcs and cache it locally till the job is done. This eventually hits the performance on the env. Is there any way to move the data location from GCS to Local disks on Public Cloud clusters? 
As written on the docs, datahub hdfs/disks spaces are temporary places but I would take this risk in favor of performance. 

@steven-matison would really appreciate your help on this question. Thanks in advance.

1 ACCEPTED SOLUTION

avatar

@fahed What you see with the CDP Public Cloud Data Hubs using GCS (or object store) is a modernization of the platform around object storage.  This removes differences across aws, azure, and on-prem (when Ozone is used).    It is a change by customer demand so that workloads are able to be built and deployed with minimal changes from on prem to cloud or cloud to cloud.   Unfortunately that creates a difference you describe above, but those are risks we are willing to take ourselves in favor of modern data architecture.

 

If you are looking for performance, you should take a look at some of the newer options for databases: impala and kudu (this one uses local disk).  Also we have Iceberg coming into this space too.

View solution in original post

1 REPLY 1

avatar

@fahed What you see with the CDP Public Cloud Data Hubs using GCS (or object store) is a modernization of the platform around object storage.  This removes differences across aws, azure, and on-prem (when Ozone is used).    It is a change by customer demand so that workloads are able to be built and deployed with minimal changes from on prem to cloud or cloud to cloud.   Unfortunately that creates a difference you describe above, but those are risks we are willing to take ourselves in favor of modern data architecture.

 

If you are looking for performance, you should take a look at some of the newer options for databases: impala and kudu (this one uses local disk).  Also we have Iceberg coming into this space too.