Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Demonstrate how easy is to create a simple data flow with NiFi, stream to Hive and visualize via Zeppelin.

Pre-requisites

  • Apache NiFi 1.1.0.2.1.0.0-165, included with Hortonworks DataFlow 2.1.0
  • Apache Zeppelin 0.6.0.2.5.0.0-1245, included with Hortonworks Data Platform 2.5.0
  • My repo for Apache NiFi "CSVToHive.xml" template, customer demographics data (customer_demographics.header, customer_demographics.csv), "Customer Demographics.json" Apache Zeppelin notebook,

    customer_demographics_orc_table_ddl.hql database and table DDLs

  • Apache Hive 1.2.1 included with HDP 2.5.0
  • Hive configured to support ACID transactions and demo database and customer_demographics created using customer_demographics_orc_table_ddl.hql

Steps

Import NiFi Template

Assuming NiFi is started and the UI available at <NiFiMasterHostName>:8086:/nifi, import the template CSVToHive.xml:

screen-shot-2017-03-06-at-74106-pm.png

13321-screen-shot-2017-03-06-at-74442-pm.png

Create Data Folder and Upload Data Files

In your home directory create /home/username/customer_demographics and upload data files specified above. Grant appropriate access to your NiFi user to be able to access it and process it via GetFile processor. Change the directory path specified in GetFile processor to match your path. Also, change the "Keep Source File" property of the GetFile processor to false as such the file is processed once and then deleted. For test reasons, I kept it as true. also, you will have to adjust Hive Metastore URI to match your environment host name.

13319-screen-shot-2017-03-06-at-60638-pm.png

Import Zeppelin Notebook

13314-screen-shot-2017-03-06-at-45350-pm.png

Execute NiFi Flow

Start all processors or start one processor at the time and follow the flow. The outcome is that each record of your CSV file will be posted to Hive demo.customer_demographics table via Hive Streaming API. As you noticed from the DDL, the Hive table is transactional. Enabling the global ACID feature of Hive and creating the table as transactional and bucketed is a requirement for this to work. Also, the data format required to allow using PutHiveStreaming processor is Avro, as such we converted the CSV to Avro. At one of the intermediary steps we could infer the Avro schema or define the CSV file header, the later option has been selected for this demo.

Execute Zeppelin Notebook

During the demo you could change from NiFi to Zeppelin showing how the data is posted in Hive and how is reflected in Zeppelin by re-executing the HiveQL blocks. The markdown (md) and shell (sh) blocks were included only for demonstration purposes, showing how a data engineer, a data analyst or a data scientist can benefit from the use of Zeppelin.

13322-screen-shot-2017-03-06-at-74721-pm.png


screen-shot-2017-03-06-at-75322-pm.pngscreen-shot-2017-03-06-at-45321-pm.png
1,293 Views
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
2 of 2
Last update:
‎08-17-2019 01:56 PM
Updated by:
 
Contributors
Top Kudoed Authors