Community Articles
Find and share helpful community-sourced technical articles
Cloudera Employee

As explained in this Parent article, here is the first step in our hybrid cloud implementation: replicating bank branches and employee data from an on-prem cluster to CDP Public Cloud.

Screen Shot 2020-11-04 at 10.03.07 AM.png

Prerequisites

  • A CDP Base Cluster (with admin rights and user part of HDFS supergroup)
  • A CDP Public Cloud environment (with admin rights)
  • CDW virtual warehouse attached to Public Cloud environment

Note: you can find the datasets used for this exercise here

Step 1: Register CDP Base as classic cluster

Start registration

  1. In your CDP Public Cloud management console, go to Classic Clusters > Add Cluster > CDH and enter your CDP Base cluster information:Screen Shot 2020-11-04 at 10.15.03 AM.png
  2. You will then see your cluster registration in progress:Screen Shot 2020-11-04 at 10.17.42 AM.png

Install a secure communication tunnel

  1. Click on the Files button in Install files and follow the instructions:Screen Shot 2020-11-04 at 10.22.38 AM.png
  2. The following are some example instructions to do this on your CM node.
    1. Download from your management console the ssh_tunnel_setup_files.zip archive:Screen Shot 2020-11-04 at 10.37.54 AM.png
    2. Copy it to your CM node

 

 

$ scp -i [your_key_location] ssh_tunnel_setup_files.zip [your_user_with_sudo_privileges]@[your_host]:/home/[your_user_with_sudo_privileges]

 

 

  • SSH to CM node and instal ccm autossh

 

 

$ ssh -i [your_key_location] [your_user_with_sudo_privileges]@[your_host]
$ sudo su
$ wget https://archive.cloudera.com/ccm/0.1/ccm-autossh-client-0.1-20191023211905gitd03880c.x86_64.rpm
$ yum -y --nogpgcheck localinstall ccm-autossh-client-0.1-20191023211905gitd03880c.x86_64.rpm

 

 

  • Install Tunnel

 

 

$ unzip ssh_tunnel_setup_files.zip
$ ./install.sh 

 

 

  • Post the installation you should see a message like this:

 

 

==========================================================================================
SSH tunnel for CM established successfully.
Run 'ccm-tunnel status' for status
Run 'journalctl -f -u ccm-tunnel@CM.service' or 'journalctl -xe' for logs.
==========================================================================================

 

 

 

 

Finish Registration

  1. In your Management console, click on Test Connection:Screen Shot 2020-11-04 at 10.41.26 AM.png
  2. Once the connection is successful, you can click on Register, add your CM user/pw and connect:Screen Shot 2020-11-04 at 10.41.57 AM.png
  3. Finally, enter the location of your base cluster (to display on dashboard map):Screen Shot 2020-11-04 at 10.45.39 AM.png
  4. You have now successfully established a secure tunnel between CDP Base and CDP Public Cloud:Screen Shot 2020-11-04 at 10.46.22 AM.png

Step 2: Create a Replication Policy

  1. Navigate to Replication Manager > Classic Clusters > 3 dots on your cluster > Add Policy:Screen Shot 2020-11-04 at 10.48.03 AM.png
  2. In our case, we are going to replicate 2 datasets from HDFS:
    • Employee data
    • Bank location data
      1. In Step 1, give a policy name and select HDFS:Screen Shot 2020-11-04 at 10.54.35 AM.png
      2. In Step 2, add the location of your dataset and name of your superuser:Screen Shot 2020-11-04 at 11.00.45 AM.png
      3. In Step 3, select s3 add your AWS credentials:
  3. After validation, enter the target bucket (your environment cloud storage) and validate:Screen Shot 2020-11-04 at 11.05.30 AM.png
  4. For the next two steps, use default settings:Screen Shot 2020-11-04 at 11.06.24 AM.pngScreen Shot 2020-11-04 at 11.06.29 AM.png
  5. After you click create, you will see the replication policy progressing.
  6. Wait for it to complete successfully then move on to the next step:Screen Shot 2020-11-04 at 11.08.45 AM.png

 

Step 3: Create external and managed tables in CDW

  1. Navigate to CDW > 3 dots of your virtual warehouse > Open Hue:Screen Shot 2020-11-04 at 11.18.35 AM.png
  2. In your query editor, run the following queries (adapting to your s3 path of course):

 

 

create database if not exists worldwidebank;


use worldwidebank;

CREATE EXTERNAL TABLE if not exists worldwidebank.employees_ext(
 number int,
 location int,
 gender string,
 title string,
 givenname string,
 middleinitial string,
 surname string,
 streetaddress string, 
 city string,
 state string,
 statefull string,
 zipcode string,
 country string,
 countryfull string,
 emailaddress string,
 username string,
 password string,
 telephonenumber string,
 telephonecountrycode string,
 mothersmaiden string,
 birthday string,
 age int,
 tropicalzodiac string,
 cctype string,
 ccnumber string,
 cvv2 string,
 ccexpires string,
 ssn string,
 insuranceid string,
 salary string,
 bloodtype string,
 weight double,
 height int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3a://pvi-e2e-cdp-bucket/vizbank/raw/employees/'
tblproperties("skip.header.line.count"="1");




CREATE EXTERNAL TABLE if not exists worldwidebank.locations_ext(
    LOCATION_ID int,
    ADDRESS string,
    BKCLASS string,
    CBSA string,
    CBSA_DIV string,
    CBSA_DIV_FLG string,
    CBSA_DIV_NO string,
    CBSA_METRO string,
    CBSA_METRO_FLG string,
    CBSA_METRO_NAME string,
    CBSA_MICRO_FLG string,
    CBSA_NO string,
    CERT string,
    CITY string,
    COUNTY string,
    CSA string,
    CSA_FLG string,
    CSA_NO string,
    ESTYMD string,
    FI_UNINUM string,
    MAINOFF string,
    NAME string,
    OFFNAME string,
    OFFNUM string,
    RUNDATE string,
    SERVTYPE string,
    STALP string,
    STCNTY string,
    STNAME string,
    UNINUM string,
    ZIP int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3a://pvi-e2e-cdp-bucket/vizbank/raw/locations/'
tblproperties("skip.header.line.count"="1");


create table worldwidebank.employees as select * from worldwidebank.employees_ext;
create table worldwidebank.locations as select * from worldwidebank.locations_ext;

CREATE MATERIALIZED VIEW worldwidebank.employees_per_state as select locations.stname, count(*) as num_employees from employees, locations where employees.location=locations.location_id GROUP BY locations.stname;

 

 

  • And that's it, you now have replicated data from your base cluster to CDP:Screen Shot 2020-11-04 at 11.23.49 AM.png

 

The next step will be to profile sensitive data to protect our employees' data.

327 Views
Don't have an account?
Version history
Revision #:
6 of 6
Last update:
‎12-02-2020 02:46 PM
Updated by:
 
Contributors
Top Kudoed Authors