Support Questions
Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

Cloudbreak Data Lake without Cloud Storage

Hi all,

Cloudbreak has a nice option to deploy a so-called "Data Lake" and attach ephemeral workload clusters to it.
However, this option demands for available Cloud Storage (on AWS, Azure or Google). Instead, we want to deploy a "Production Data Lake" on our on-premise OpenStack cloud that provides the storage for HDFS but also executes our production workloads. We then would like to attach "Test Clusters" to this production cluster where we can run test workloads but access the data in the production cluster (aka Data Lake) such that we do not have to copy data from one cluster to the other.

Is there a way to setup this with Cloudbreak? To be clear, we do not want to use Cloud Storage from AWS, Azure or Google.

Any ideas or hints?

Thanks a lot!


Expert Contributor

Hi @Alexander Schätzle,

You can try to create one with CLI by removing a storage part from the json. Although we haven't tested this configuration, and also we use the cloud storage to share data between the datalake and workload clusters. Maybe you can set up the clusters to use swift3 over S3, but this is also not tested.

Hi @mmolnar,

is the cloud storage also used to actually store the data? For example the data stored in Hive? My interpretation of a data lake is that it holds the actual data, not only meta data and shared security services. But that would wean that in the "data lake" setup option of Cloudbreak data is actually not stored in HDFS and not on-prem but in the cloud. Is that correct?

Thx for your help!

Expert Contributor

Hi @Alexander Schätzle,

CB setup these properties using cloud storage for both datalake and workload cluster:

HDFS between Datalake and Workload clusters is not shared, so you can only access resources, which are stored on cloud storage.

Hi @mmolnar,

would this actually mean that workload clusters on our own on-prem OpenStack cloud would have to access or transfer the data from the cloud storage for processing? This doesn't sound feasible.

We want to run the data lake and workload clusters in our on-prem OpenStack cloud. It does not make sense in this case to store all the data in a public cloud and transfer it on-prem for processing. It sounds to me that the data lake deployment option of CB is more intended for AWS, Azure and Google but not really suitable for on-prem OpenStack clouds. Would you agree on that?

Expert Contributor

Hi @Alexander Schätzle,

as now we do not support OpenStack cloud storage solution (swift), it doesn't make sense using Datalake on prem. If you check the documentation, it defines the cloud storage es a prerequisite:

Hi mmolnar,

ok thanks for clarification. It would then make sense to note in the documentation that the Data Lake deployment option is currently only suitable for AWS, Azure and Google but not for OpenStack.

Expert Contributor

@Dominika Bialek

could you add this issue to the documentation?

@mmolnar Thanks for tagging me. I am adding it to the docs that the data lake deployment option is currently only suitable for AWS, Azure and Google but not for OpenStack