Created on 06-29-2018 10:12 PM
In this tutorial, I go through the steps of creating a data lake instance and creating a cluster attached to the data lake. I cover you how to use Cloudbreak web UI and CLI to perform these steps.
Assumptions:
This post has been updated to reflect the latest functionality available in Cloudbreak 2.9.0.
A data lake provides a way for you to centrally apply and enforce authentication, authorization, and audit policies across multiple ephemeral workload clusters. "Attaching" your workload cluster to the data lake instance allows the attached cluster workloads to access data and run in the security context provided by the data lake.
In general, setting up a data lake involves the following steps:
When creating a data lake, you can choose from one of the two available blueprints.
The following data lake blueprints are provided by default in Cloudbreak:
Depending on your use case, select one of these blueprints.
Meet the following prerequisites outside of Cloudbreak:
In the steps that follow, you will be required to provide the information related to these external resources.
Prior to creating a data lake, you must register the following resources in the Cloudbreak web UI:
As an outcome of this step, you should have two or three external databases and one authentication source registered in the Cloudbreak web UI:
Create a data lake by using the create cluster wizard. Among other information, make sure to provide the information listed in the steps below.
1.In Cloudbreak web UI, navigate to Clusters, click on Create Cluster.
2.On the General Configuration page:
3.(Only if using the HA blueprint) On the Hardware and Storage page you can select the following for each host group:
4.(Only if using the HA blueprint) On the Network and Availability page, enter:
5.On the Cloud Storage page:
6.On the External Sources page, select the previously registered Ranger database, Hive database and LDAP.
If using the HA blueprint, also select the previously registered Ambari database:
7.On the Gateway Configuration page, the gateway is enabled by default with Ambari exposed through the gateway. You should also enable Ranger by selecting the Ranger service and clicking Expose.
8.On the Network Security Groups page, you do not need to change anything. If you would like to restrict the open ports, refer to Default cluster security groups.
9.On the Security page:
10.Click Create Cluster to initiate data lake creation.
As an outcome of this step, you should have a running data lake. Once the data lake is running, you can create workload clusters attached to it.
Once your data lake is running, you can start creating clusters attached to the data lake. Follow these general steps to create a cluster attached to a data lake.
In general, once you've selected the data lake that the cluster should be using, the cluster wizard should provide you with the cluster settings that should be used for the attached cluster.
1.In the Cloudbreak web UI, click on the cluster tile representing your data lake.
2.From the ACTIONS menu, select CREATE ATTACHED CLUSTER.
3.In general, the cluster wizard should provide you with the cluster settings that should be used for the attached cluster. Still, make sure to do the following:
4.Click on CREATE CLUSTER to initiate cluster creation.
As an outcome of this step, you should have a running cluster attached to the data lake. Access your attached clusters and run your workloads as normal.
Cloudbreak CLI makes it possible to perform the exact same tasks as those that can be performed from the web UI.
The CLI is available for download from the web UI under Download CLI.
Download it and then access it from that location by using ./cb (or add it to your PATH).
In order to start using it, you must configure it with the specific Cloudbreak instance. For example:
cb configure --server https://my-cb-intstanceIP.com --username myuser@example.com --password MySecurePassword1234!
You can use the following CLI commands to register the databases and LDAP:
cb database create cb ldap create
There are multiple ways to obtain the syntax but the easier way is to enter the parameters in the Cloudbreak web UI under External Sources > Database Configurations and Authentication Configurations and then obtain the correct commands from the SHOW CLI COMMAND option.
Similar to the previous step, the easiest way to obtain the correct JSON for the data lake is by using the UI. So I recommend that you start by doing the following:
1.Provide the parameters for your data lake in the web UI > create cluster - as described in the instructions above.
2.On the last page of the create wizard, click on SHOW CLI COMMAND to generate a proper JSON template. Once you’ve done that once, you can reuse your JSON file.
3.Paste the JSON content into a text editor and save it as a JSON file.
4.Review the JSON. The generated JSON will never store the cluster user password so you need to provide a strong password.
5.Use the following command to create a data lake:
cb cluster create --cli-input-json /path/to/json/mydatalake.json
Once you have a running data lake, similar to the previous steps, the easiest way to obtain the correct JSON for attached clusters is by using the UI:
1.Provide the parameters for your data lake in the web UI > create cluster - as described in the instructions above.
2.On the last page of the create wizard, click on SHOW CLI COMMAND to generate a proper JSON.
3.Paste the JSON content into a text editor and save it as a JSON file.
4.Review the JSON. The generated JSON will never store the cluster user password so you need to provide a strong password.
5.Use the following command to create a data lake:
cb cluster create --cli-input-json /path/to/json/mycluster.json
Once you’ve done that once, you can reuse your JSON file.
The following parameters are used to reference the data lake to which the cluster is attached:
"sharedService": {"sharedCluster":"test-datalake"},
Created on 02-07-2019 05:18 PM
Updated for Cloudbreak 2.9. A new HDP 3.1 data lake blueprint is available.