Created on 11-18-2020 08:50 AM - edited on 11-23-2020 09:46 PM by VidyaSargur
A quick way to get started with CDP and Cloudera Data Warehouse (CDW) is to hydrate your environment with some existing relational tables you have. In this article, we'll walk through ingesting existing relational tables into CDP, creating table definitions on top of this data, and then briefly explore the question of "now what?"
A quick note: In this article, we'll use Apache Sqoop to do some full table copies as well as some simple selects/copies based on a where condition. If you plan on doing anything more complex or have thoughts of setting up something more permanent, you may want to consider another solution (like NiFi).
Special thanks to @pvidal for the initial dataset and RDS creation automation.
I have created an Oracle database in one of my AWS accounts. Keep in mind, whatever your source data, you'll need to make sure that proper ingress is enabled in your CDP environment. I'll skip showing the specifics for my case but suffice to say I needed to make sure that my CDP environment could make a JDBC connection to my RDS.
Here's a quick snapshot of my data. I have three tables populated with some sample data:
cdc.location
cdc.vaccine
cdc.vaccination_rate
Spin up a "Data Engineering" Data Hub cluster in your CDP environment (or reuse an existing one). You'll need your Environment Admin to do some quick prep for you - namely, drop your JDBC jar into Sqoop's lib directory. In this case, it's the OJDBC jar:
$ sudo cp ojdbc7-12.1.0.2.jar /opt/cloudera/parcels/CDH/lib/sqoop/lib/ && sudo chmod 777 /opt/cloudera/parcels/CDH/lib/sqoop/lib/ojdbc7-12.1.0.2.jar
Now we need to prep a storage location for our data to be copied to. Technically, you have two options here. You can ingest your data to the HDFS on your DataHub cluster (which is what will happen if you don't specify a target directory in your Sqoop command - it will go to your HDFS home directory). But in order to be able to utilize these tables across the entire CDP landscape, we'll copy them to Cloud Storage (our environment s3 bucket).
SSH into your Data Hub cluster (as yourself) and create a staging location on your Cloud Storage.
$ hdfs dfs -mkdir -p s3a://perro-cdp-bucket/perro-de/sqoop/cli/tables
This implies you have the proper IDBroker mappings. See the Cloudera Data Platform Docs for more details.
Now, it's time to actually do the sqoop-ing! In the below examples, I just do a full copy of the three oracle tables. See the Sqoop Docs for additional examples and configurations (like custom select statements and utilizing more YARN resources).
The main parameters (that you'll have to change according to your setup) are:
$ sqoop import --connect jdbc:oracle:thin:@cmp-cdp-workshop-orcl-rds.cjofigxjk44c.us-east-1.rds.amazonaws.com:1521:ORCL --username cdc --password cdpw0rksh0p --table CDC.LOCATION --target-dir 's3a://perro-cdp-bucket/perro-de-2/sqoop/cli/tables/location'
$ sqoop import --connect jdbc:oracle:thin:@cmp-cdp-workshop-orcl-rds.cjofigxjk44c.us-east-1.rds.amazonaws.com:1521:ORCL --username cdc --password cdpw0rksh0p --table CDC.VACCINE --target-dir 's3a://perro-cdp-bucket/perro-de-2/sqoop/cli/tables/vaccine'
$ sqoop import --connect jdbc:oracle:thin:@cmp-cdp-workshop-orcl-rds.cjofigxjk44c.us-east-1.rds.amazonaws.com:1521:ORCL --username cdc --password cdpw0rksh0p --table CDC.VACCINATION_RATE --target-dir 's3a://perro-cdp-bucket/perro-de-2/sqoop/cli/tables/vaccination_rate'
Note: I did encounter some oddness with Oracle or Sqoop or the OJDBC jar - I *had* to capitalize the table names.
After your job completes, you can do a quick sanity check using the HDFS CLI:
Now that you have hydrated your CDP environment with some data, the world is your oyster! You could
Let's do the following to run with our second example "Next Step": building some Hive tables and using Cloudera Data Warehouse:
CREATE EXTERNAL TABLE LOCATION
(
location_id INT,
location_name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3a://perro-cdp-bucket/perro-data-eng/sqoop/cli/tables/location';
CREATE EXTERNAL TABLE VACCINE
(
vaccine_id INT,
vaccine_name STRING,
vaccine_dose INT,
vaccine_age STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3a://perro-cdp-bucket/perro-data-eng/sqoop/cli/tables/vaccine';
CREATE EXTERNAL TABLE VACCINATION_RATE
(
location_id INT,
vaccine_id INT,
year INT,
rate FLOAT,
upper_limit FLOAT,
lower_limit FLOAT,
confidence_interval FLOAT,
sample_size INT,
target FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3a://perro-cdp-bucket/perro-data-eng/sqoop/cli/tables/vaccination_rate';
You now have a bar chart that shows the number of vaccines by age they are administered!