Created on 08-18-2022 08:06 PM - edited 08-19-2022 12:10 PM
Over the last few years, Cloudera has evolved to become the Hybrid Data Company. Inarguably, the future of the data ecosystem is a hybrid one that enables common security and governance across on-premise and multi-cloud environments. In this article you will learn how Cloudera enables key hybrid capabilities like application portability, and data replication in order to quickly move workloads and data to the cloud.
Let's explore how Cloudera Data Platform (CDP) excels at following hybrid use cases, through data pipeline replication and data pipeline migration exercises to the public cloud.
The following diagram shows a data pipeline in private & public form factors -
Below are the steps to replicate a data pipeline from a private cloud base (PVC) cluster to a public cloud (PC) environment -
Prior to creating replication policies, register the private cloud cluster in the public cloud environment. Go to Classic Clusters in CDP PC Management Console and add PVC cluster. In the Cluster Details dialog, select the CDP Private Cloud Base option and fill out all the details. If you have Knox enabled in your PVC cluster, check the "Register KNOX endpoint (Optional)" box and add the KNOX IP address & port.
To ensure the PVC cluster works with Replication Manager in the PC environment, please follow the instructions given in Adding CDP Private Cloud Base cluster for use in Replication Manager and Data Catalog.
For more details on adding & managing classic clusters, please visit Managing classic clusters.
Once the PVC cluster is added, proceed to create policies in Replication Manager.
Create HDFS Policy
Create Hive Policy
Create HBase Policy
Great job if you have made it this far!
So what did we learn? We learned that with CDP's hybrid capabilities we are able to easily replicate and migrate data pipelines from one environment to another, whether on-premise or in the public cloud with minimal effort! In this exercise, all the code that ran on the private cloud was successfully replicated over to AWS environment with minimal configuration and code changes (due to change in the filesystem - HDFS > S3).
A similar set of steps can be followed to migrate data pipelines to Microsoft Azure and Google Cloud Platform (GCP) environments.
When it comes to migrating to the cloud, it's usually extremely risky to migrate all workloads at the same time. As part of your migration strategy, you need to be able to divide the workloads in a logical manner and migrate them accordingly, and CDP gives you great flexibility to ensure the migration is done according to your strategy.
Please don't hesitate to add a comment if you have any questions.