Community Articles

pvidal · ‎12-02-2020

Welcome to Part 3 of my article series on how to harness the hybrid cloud series (see parent article here). In this tutorial you will learn to use NiFi to:

Pull information from public APIs
Push this raw data to secure S3 bucket using SDX integration
Create Hive tables on top of this data by connecting to CDW

This corresponds to step 3 of the series, as explained below:

Screen Shot 2020-12-02 at 6.49.09 PM.png

Note: The anonymized NiFi flow can be found here.

Prerequisites

Complete Part 2 of this article series
A NiFi Datahub in the environment you created for Part 1 and Part 2

Step 1: Prepare your NiFi Datahub for JDBC connection to Hive in CDW

Download driver jar

Navigate to your Management console > Data Warehouse > find your virtual warehouse and download the driver jar:

Screen Shot 2020-12-02 at 6.57.35 PM.png

Upload jar to NiFi nodes

Navigate to the Management console > your NiFi datahub > Hardware and note the public IPs of the NiFi nodes:

Using these public IPs (here hidden), upload the downloaded jar to all three nodes (see example code below, using your workload user and password)

$ scp hive-jdbc-3.1.0-SNAPSHOT-standalone.jar [YOUR_USER]@[NODE_1]:/tmp/
$ scp hive-jdbc-3.1.0-SNAPSHOT-standalone.jar [YOUR_USER]@[NODE_2]:/tmp/
$ scp hive-jdbc-3.1.0-SNAPSHOT-standalone.jar [YOUR_USER]@[NODE_3]:/tmp/

$ ssh [YOUR_USER]@[NODE_1] chmod a+rw /tmp/hive-jdbc-3.1.0-SNAPSHOT-standalone.jar
$ ssh [YOUR_USER]@[NODE_2] chmod a+rw /tmp/hive-jdbc-3.1.0-SNAPSHOT-standalone.jar
$ ssh [YOUR_USER]@[NODE_3] chmod a+rw /tmp/hive-jdbc-3.1.0-SNAPSHOT-standalone.jar

Copy the JDBC URL

Navigate to your Management console > Data Warehouse > find your virtual warehouse and copy the JDBC URL:

Screen Shot 2020-12-02 at 7.04.49 PM.png

You should get something along these lines:

jdbc:hive2://[YOUR_CDW_HOST]/default;transportMode=http;httpPath=cliservice;ssl=true;retries=3

Step 2: Configure the NiFi flow

This tutorial requires the creation of two NiFi flow (one to map zip codes to attitude and longitude, and one to get the latest covid cases numbers):

Screen Shot 2020-12-02 at 7.08.25 PM.png

In this tutorial, I will only detail the configuration of one of the two flows as they are almost identical except for file/table/field names. The full example code is in this gist. Here is the overview of the US Geocode flow:

Screen Shot 2020-12-02 at 7.10.34 PM.png

Get data from API to secure S3 using SDX

For this part, we first use a simple configuration of an invoke http:

Screen Shot 2020-12-02 at 7.12.22 PM.png

The remote URL called is:

https://data.healthcare.gov/api/views/52wv-g36k/rows.csv?accessType=DOWNLOAD&api_foundry=true

Then, we use a replace attribute to replace the filename and make sure we override data:

Screen Shot 2020-12-02 at 7.13.28 PM.png

Finally, we use a put HDFS with the following parameters:

Hadoop Configuration Resources: /etc/hadoop/conf.cloudera.core_settings/core-site.xml
Kerberos Principal: [your workload user]
Kerberos Password: [your workload password]
Directory: s3a://[your env bucket]/raw/geocodes_by_county/
Conflict Resolution Strategy: replace

Screen Shot 2020-12-02 at 7.13.37 PM.png

Drop and create tables

For both drop and create tables, we first use a Replace Text to send the query. For example:

Screen Shot 2020-12-02 at 7.17.54 PM.png

with replacement value of:

drop TABLE if exists worldwidebank.us_geocodes_by_county;
drop TABLE if exists worldwidebank.us_geocodes_by_county_ext;

Then we use a puthive3QL with default parameters:

Screen Shot 2020-12-02 at 7.19.12 PM.png

The only thing needed to configure to make this work is the Hive3ConnectionPool, configured as follows:

Database Connection URL: [your JDBC URL]
Database User: [your workload user]
Password: [your workload password]

Screen Shot 2020-12-02 at 7.20.33 PM.png

Step 3: Verify Data Creation

After executing both flow, navigate to Hue from CDW and look at the data, as such:

Screen Shot 2020-12-02 at 7.22.51 PM.png

Cloudera Community