Community Articles

VidyaSargur · ‎11-18-2020

A quick way to get started with CDP and Cloudera Data Warehouse (CDW) is to hydrate your environment with some existing relational tables you have. In this article, we'll walk through ingesting existing relational tables into CDP, creating table definitions on top of this data, and then briefly explore the question of "now what?"

A quick note: In this article, we'll use Apache Sqoop to do some full table copies as well as some simple selects/copies based on a where condition. If you plan on doing anything more complex or have thoughts of setting up something more permanent, you may want to consider another solution (like NiFi).

Special thanks to @pvidal for the initial dataset and RDS creation automation.

Determine the Source Data

I have created an Oracle database in one of my AWS accounts. Keep in mind, whatever your source data, you'll need to make sure that proper ingress is enabled in your CDP environment. I'll skip showing the specifics for my case but suffice to say I needed to make sure that my CDP environment could make a JDBC connection to my RDS.

Screen Shot 2020-09-18 at 11.43.24 AM.png

Here's a quick snapshot of my data. I have three tables populated with some sample data:

cdc.location
cdc.vaccine
cdc.vaccination_rate

Screen Shot 2020-09-18 at 11.50.19 AM.png

Sqoop Setup

Spin up a "Data Engineering" Data Hub cluster in your CDP environment (or reuse an existing one). You'll need your Environment Admin to do some quick prep for you - namely, drop your JDBC jar into Sqoop's lib directory. In this case, it's the OJDBC jar:

$ sudo cp ojdbc7-12.1.0.2.jar /opt/cloudera/parcels/CDH/lib/sqoop/lib/ && sudo chmod 777 /opt/cloudera/parcels/CDH/lib/sqoop/lib/ojdbc7-12.1.0.2.jar

Now we need to prep a storage location for our data to be copied to. Technically, you have two options here. You can ingest your data to the HDFS on your DataHub cluster (which is what will happen if you don't specify a target directory in your Sqoop command - it will go to your HDFS home directory). But in order to be able to utilize these tables across the entire CDP landscape, we'll copy them to Cloud Storage (our environment s3 bucket).

SSH into your Data Hub cluster (as yourself) and create a staging location on your Cloud Storage.

$ hdfs dfs -mkdir -p s3a://perro-cdp-bucket/perro-de/sqoop/cli/tables

This implies you have the proper IDBroker mappings. See the Cloudera Data Platform Docs for more details.

Sqoop Execution

Now, it's time to actually do the sqoop-ing! In the below examples, I just do a full copy of the three oracle tables. See the Sqoop Docs for additional examples and configurations (like custom select statements and utilizing more YARN resources).

The main parameters (that you'll have to change according to your setup) are:

the JDBC URL - e.g. jdbc:oracle:thin:@cmp-cdp-workshop-orcl-rds.cjofigxjk44c.us-east-1.rds.amazonaws.com:1521:ORCL, which is of the format jdbc:oracle:thin:@<your_rds>:<your_rds_port>:<db_sid>
the DB username - e.g. "cdc"
the DB user's password - e.g. "cdpw0rksh0p"
the table name - e.g. "CDC.LOCATION"
the target directory - e.g. "s3a://<your_cdp_bucket>"

$ sqoop import --connect jdbc:oracle:thin:@cmp-cdp-workshop-orcl-rds.cjofigxjk44c.us-east-1.rds.amazonaws.com:1521:ORCL --username cdc --password cdpw0rksh0p --table CDC.LOCATION --target-dir 's3a://perro-cdp-bucket/perro-de-2/sqoop/cli/tables/location'

$ sqoop import --connect jdbc:oracle:thin:@cmp-cdp-workshop-orcl-rds.cjofigxjk44c.us-east-1.rds.amazonaws.com:1521:ORCL --username cdc --password cdpw0rksh0p --table CDC.VACCINE --target-dir 's3a://perro-cdp-bucket/perro-de-2/sqoop/cli/tables/vaccine'

$ sqoop import --connect jdbc:oracle:thin:@cmp-cdp-workshop-orcl-rds.cjofigxjk44c.us-east-1.rds.amazonaws.com:1521:ORCL --username cdc --password cdpw0rksh0p --table CDC.VACCINATION_RATE --target-dir 's3a://perro-cdp-bucket/perro-de-2/sqoop/cli/tables/vaccination_rate'

Note: I did encounter some oddness with Oracle or Sqoop or the OJDBC jar - I *had* to capitalize the table names.

After your job completes, you can do a quick sanity check using the HDFS CLI:

Screen Shot 2020-09-18 at 12.54.45 PM.png

Next Steps

Now that you have hydrated your CDP environment with some data, the world is your oyster! You could

Process the data with Cloudera Data Engineering (or another Datahub)
Build Impala/Hive tables on top of the data and use Cloudera Data Warehouse
Leverage the data in Cloudera Machine Learning

BONUS - CDW and Viz Apps

Let's do the following to run with our second example "Next Step": building some Hive tables and using Cloudera Data Warehouse:

Presuming you have a CDW environment, data catalog, and virtual warehouse at your disposal, let's create the CDP tables:

CREATE EXTERNAL TABLE LOCATION
(
  location_id INT,
  location_name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3a://perro-cdp-bucket/perro-data-eng/sqoop/cli/tables/location';

CREATE EXTERNAL TABLE VACCINE
(
  vaccine_id INT,
  vaccine_name STRING,
  vaccine_dose INT,
  vaccine_age STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3a://perro-cdp-bucket/perro-data-eng/sqoop/cli/tables/vaccine';

CREATE EXTERNAL TABLE VACCINATION_RATE
(
  location_id INT,
  vaccine_id INT,
  year INT,
  rate FLOAT,
  upper_limit FLOAT,
  lower_limit FLOAT,
  confidence_interval FLOAT,
  sample_size INT,
  target FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3a://perro-cdp-bucket/perro-data-eng/sqoop/cli/tables/vaccination_rate';

Now you can query/data engineer/machine learn across your CDP environment. I happen to create my virtual warehouse with "Data Visualization":
So, I'll continue on and create a quick Dashboard using Cloudera Data Visualization. Click on the options menu in your virtual warehouse, and, instead of selecting Hue or DAS, click on "Open Data Visualization":
Log in and Head to the "Data" tab:
From there, you can click on the Default Hive VW and "Connection Explorer" to check out the tables available to you:
Above, you can see the three tables we created in the default database. Let's create a quick dataset so we can use it to create a Dashboard. Click on the "New Dataset" button, give your Dataset a name and choose which database and table you want your Dataset to use.
You should now see your new Dataset in the "Datasets" tab. To create a Dashboard using this Dataset, click the icon next to your Dataset:
You should be presented with your data in a table, but that's boring. So let's create a real quick Bar Chart to wrap things up. Click the third little icon in the visuals tray to turn the object into a Bar Chart. Then, drag the "vaccine_age" dimension to "X-Axis" and "Record Count" measure to "Y-Axis". Finally, click the "Refresh Visual" button.