Community Articles

Dominika · ‎06-29-2018

In this tutorial, I go through the steps of creating a data lake instance and creating a cluster attached to the data lake. I cover you how to use Cloudbreak web UI and CLI to perform these steps.

Assumptions:

These steps assume that you already have a running instance of Cloudbreak and that you have created a Cloudbreak credential that allows you to connect to your cloud provider.
All the resources that I am creating in the tutorial are running on AWS. Whatever cloud provider you are using, I assume that you have an SSH key set up (if needed) and some network and subnet that you would like to use. Note that the data lake functionality cannot be used with OpenStack.
The steps require that you create some resources (databases, LDAP, cloud storage) outside of Cloudbreak. See the "Meet the Prerequisites" section.
This functionality is technical preview in Cloudbreak 2.9.x. It is not suitable for production.

Update for Cloudbreak 2.9.0

This post has been updated to reflect the latest functionality available in Cloudbreak 2.9.0.

What is a data lake

A data lake provides a way for you to centrally apply and enforce authentication, authorization, and audit policies across multiple ephemeral workload clusters. "Attaching" your workload cluster to the data lake instance allows the attached cluster workloads to access data and run in the security context provided by the data lake.

Learn more

Overview of steps

In general, setting up a data lake involves the following steps:

Decide which HDP version and which data lake blueprint to use
Meet the prerequisites:
- You must have an existing LDAP
- You must have an external database for Hive (such as an RDS on AWS)
- You must have an external database for Ranger (such as an RDS on AWS)
- If planning to use the HA blueprint, you must have an external database for Ambari (such as an RDS on AWS)
- You must have an existing cloud storage location (such as an S3 bucket)
Register the two databases and LDAP
Create a data lake
Create attached workload clusters

Data lake blueprints

When creating a data lake, you can choose from one of the two available blueprints.

The following data lake blueprints are provided by default in Cloudbreak:

HDP 2.6 Data Lake: Apache Ranger, Apache Atlas, Apache Hive Metastore: Includes Apache Ranger, Apache Atlas, and Apache Hive Metastore.
HDP 2.6 Data Lake: Apache Ranger, Apache Hive Metastore HA: Includes Apache Ranger and Apache Hive Metastore in HA mode. Automatic and manual recovery options are available for this type of data lake. Includes two master host groups.)
HDP 3.1 Data Lake: Apache Ranger, Apache Hive Metastore HA: Includes Apache Ranger and allows all clusters attached to a data lake to connect to the same Hive Metastore. Note that Hive Metastore has been removed from the HDP 3.x data lake blueprints, but setting up an external database allows all clusters attached to a data lake to connect to the same Hive Metastore.

Depending on your use case, select one of these blueprints.

Meet the prerequisites

Meet the following prerequisites outside of Cloudbreak:

Set up two external database instances, one for the HIVE component, and one for the RANGER component.
- For supported databases, see Supported databases.
- I set up two micro RDS instances on AWS and created a database on each. I created these in the same region and VPC where I am planning to launch my data lake.
- It is possible to use one database instance but you must create two separate databases.
If planning to use the HA blueprint, you must have an external database for Ambari (such as an RDS on AWS)
Create an LDAP instance and set up your users inside the LDAP.
Prepare a cloud storage location for default Hive warehouse directory and Ranger audit logs. I created an S3 bucket on AWS. On AWS, you must also have an IAM instance profile that allows access to that bucket.

In the steps that follow, you will be required to provide the information related to these external resources.

Register the databases and LDAP in the Cloudbreak web UI

Prior to creating a data lake, you must register the following resources in the Cloudbreak web UI:

Register each of your two databases created as part of the prerequisites in the Cloudbreak web UI, under External Sources > Database Configurations. For instructions, see Register an external database.
- When registering the database for Hive, select Type > Hive.
- When registering the database for Ranger, select Type > Ranger.
- The endpoint needs to be in the following format: INSTANCE-URL:PORT/DB-NAME
If using the HA blueprint, also register external database for Ambari. When registering the database for Hive, select Type > Ambari.
Register your LDAP (created as part of the prerequisites) in the Cloudbreak web UI, under External Sources > Authentication Configurations. For instructions, see Register an authentication source.

As an outcome of this step, you should have two or three external databases and one authentication source registered in the Cloudbreak web UI:

Create a data lake

Create a data lake by using the create cluster wizard. Among other information, make sure to provide the information listed in the steps below.

1.In Cloudbreak web UI, navigate to Clusters, click on Create Cluster.

2.On the General Configuration page:

Under Cluster Name, provide a name for your data lake.
Under Cluster Type, choose one of the two available "Data Lake" blueprints: either "Data Lake: Apache Ranger, Apache Atlas, Apache Hive Metastore" or "Data Lake: Apache Ranger, Apache Hive Metastore HA".

3.(Only if using the HA blueprint) On the Hardware and Storage page you can select the following for each host group:

Under Instance Count, you can optionally specify how many nodes should be included in each host group. By default, Cloudbreak creates the minimum viable number of nodes. We recommend placing an odd node number of nodes in each host group. A total of 3 or 5 instances is recommended.
You can optionally select to Enable Auto Recovery. Enabling this option will allow Cloudbreak to automatically recover any failed nodes. Without checking this option, you will have to manually trigger recovery of the failed nodes.

4.(Only if using the HA blueprint) On the Network and Availability page, enter:

Custom Domain: Enter some domain name that Cloudbreak can use locally. For example “mydatalake.local”. This domain name is for local use and does not require DNS services.
Custom Hostname: Enter some name convention to use for the host names. For example “prod”.
For example, if using “mydatalake.local” as a custom domain and “prod” as a host name, the actual host names will be prod0.cluster-name.mydatalake.local, prod1.cluster-name.mydatalake.local, and so on.
5.On the Cloud Storage page:
- Under Cloud Storage, configure access to cloud storage via the method available for your cloud provider.
- Under Storage Locations, provide an existing location within your cloud storage account that can be used for Ranger audit logs, and - if using HDP 2.6 - Hive warehouse directory. If using the Atlas blueprint, this will also be used for HBase Root Directory.
  
  Note: The storage location must exist prior to data lake provisioning. If the storage location does not exist then Ranger is installed properly, but it may not work.

6.On the External Sources page, select the previously registered Ranger database, Hive database and LDAP.
If using the HA blueprint, also select the previously registered Ambari database:

7.On the Gateway Configuration page, the gateway is enabled by default with Ambari exposed through the gateway. You should also enable Ranger by selecting the Ranger service and clicking Expose.

8.On the Network Security Groups page, you do not need to change anything. If you would like to restrict the open ports, refer to Default cluster security groups.
9.On the Security page:

Under Password, provide a strong password for your cluster. For example “SomeRandomChars123!” is a strong password. A strong password is required for the default Ranger admin, which - among other cluster components like Ambari - will use this password.
Select an SSH key.

10.Click Create Cluster to initiate data lake creation.

As an outcome of this step, you should have a running data lake. Once the data lake is running, you can create workload clusters attached to it.

Create attached HDP clusters

Once your data lake is running, you can start creating clusters attached to the data lake. Follow these general steps to create a cluster attached to a data lake.

In general, once you've selected the data lake that the cluster should be using, the cluster wizard should provide you with the cluster settings that should be used for the attached cluster.

1.In the Cloudbreak web UI, click on the cluster tile representing your data lake.

2.From the ACTIONS menu, select CREATE ATTACHED CLUSTER.

3.In general, the cluster wizard should provide you with the cluster settings that should be used for the attached cluster. Still, make sure to do the following:

Under Region and Availability Zone, select the same location where your data lake is running.
Select one of the three default blueprints.
On the Cloud Storage page, enter the same cloud storage location that your data lake is using.
On the External Sources page, the LDAP, and Ranger and Hive databases that you attached to the data lake should be attached to your cluster.
On the Network page, select the same VPC and subnet where the data lake is running.

4.Click on CREATE CLUSTER to initiate cluster creation.

As an outcome of this step, you should have a running cluster attached to the data lake. Access your attached clusters and run your workloads as normal.

Perform the Same Steps with the CLI

Cloudbreak CLI makes it possible to perform the exact same tasks as those that can be performed from the web UI.

Download and configure the CLI

The CLI is available for download from the web UI under Download CLI.

Download it and then access it from that location by using ./cb (or add it to your PATH).

In order to start using it, you must configure it with the specific Cloudbreak instance. For example:

cb configure --server https://my-cb-intstanceIP.com --username  myuser@example.com --password MySecurePassword1234!

Register the databases and LDAP from the CLI

You can use the following CLI commands to register the databases and LDAP:

cb database create
cb ldap create

There are multiple ways to obtain the syntax but the easier way is to enter the parameters in the Cloudbreak web UI under External Sources > Database Configurations and Authentication Configurations and then obtain the correct commands from the SHOW CLI COMMAND option.

Create a data lake with the CLI

Similar to the previous step, the easiest way to obtain the correct JSON for the data lake is by using the UI. So I recommend that you start by doing the following:

1.Provide the parameters for your data lake in the web UI > create cluster - as described in the instructions above.

2.On the last page of the create wizard, click on SHOW CLI COMMAND to generate a proper JSON template. Once you’ve done that once, you can reuse your JSON file.

3.Paste the JSON content into a text editor and save it as a JSON file.

4.Review the JSON. The generated JSON will never store the cluster user password so you need to provide a strong password.

5.Use the following command to create a data lake:

cb cluster create --cli-input-json /path/to/json/mydatalake.json

Create attached HDP clusters with the CLI

Once you have a running data lake, similar to the previous steps, the easiest way to obtain the correct JSON for attached clusters is by using the UI:

1.Provide the parameters for your data lake in the web UI > create cluster - as described in the instructions above.

2.On the last page of the create wizard, click on SHOW CLI COMMAND to generate a proper JSON.