Created on 12-02-2020 04:23 PM - edited 12-03-2020 05:32 AM
Welcome to Part 3 of my article series on how to harness the hybrid cloud series (see parent article here). In this tutorial you will learn to use NiFi to:
This corresponds to step 3 of the series, as explained below:
Note: The anonymized NiFi flow can be found here.
Navigate to your Management console > Data Warehouse > find your virtual warehouse and download the driver jar:
Navigate to the Management console > your NiFi datahub > Hardware and note the public IPs of the NiFi nodes:
Using these public IPs (here hidden), upload the downloaded jar to all three nodes (see example code below, using your workload user and password)
$ scp hive-jdbc-3.1.0-SNAPSHOT-standalone.jar [YOUR_USER]@[NODE_1]:/tmp/
$ scp hive-jdbc-3.1.0-SNAPSHOT-standalone.jar [YOUR_USER]@[NODE_2]:/tmp/
$ scp hive-jdbc-3.1.0-SNAPSHOT-standalone.jar [YOUR_USER]@[NODE_3]:/tmp/
$ ssh [YOUR_USER]@[NODE_1] chmod a+rw /tmp/hive-jdbc-3.1.0-SNAPSHOT-standalone.jar
$ ssh [YOUR_USER]@[NODE_2] chmod a+rw /tmp/hive-jdbc-3.1.0-SNAPSHOT-standalone.jar
$ ssh [YOUR_USER]@[NODE_3] chmod a+rw /tmp/hive-jdbc-3.1.0-SNAPSHOT-standalone.jar
Navigate to your Management console > Data Warehouse > find your virtual warehouse and copy the JDBC URL:
You should get something along these lines:
jdbc:hive2://[YOUR_CDW_HOST]/default;transportMode=http;httpPath=cliservice;ssl=true;retries=3
This tutorial requires the creation of two NiFi flow (one to map zip codes to attitude and longitude, and one to get the latest covid cases numbers):
In this tutorial, I will only detail the configuration of one of the two flows as they are almost identical except for file/table/field names. The full example code is in this gist. Here is the overview of the US Geocode flow:
For this part, we first use a simple configuration of an invoke http:
The remote URL called is:
https://data.healthcare.gov/api/views/52wv-g36k/rows.csv?accessType=DOWNLOAD&api_foundry=true
Then, we use a replace attribute to replace the filename and make sure we override data:
Finally, we use a put HDFS with the following parameters:
For both drop and create tables, we first use a Replace Text to send the query. For example:
with replacement value of:
drop TABLE if exists worldwidebank.us_geocodes_by_county;
drop TABLE if exists worldwidebank.us_geocodes_by_county_ext;
Then we use a puthive3QL with default parameters:
The only thing needed to configure to make this work is the Hive3ConnectionPool, configured as follows:
After executing both flow, navigate to Hue from CDW and look at the data, as such: