Community Articles

cstanca · ‎03-06-2017

Demonstrate how easy is to create a simple data flow with NiFi, stream to Hive and visualize via Zeppelin.

Pre-requisites

Apache NiFi 1.1.0.2.1.0.0-165, included with Hortonworks DataFlow 2.1.0
Apache Zeppelin 0.6.0.2.5.0.0-1245, included with Hortonworks Data Platform 2.5.0
My repo for Apache NiFi "CSVToHive.xml" template, customer demographics data (customer_demographics.header, customer_demographics.csv), "Customer Demographics.json" Apache Zeppelin notebook,
customer_demographics_orc_table_ddl.hql database and table DDLs
Apache Hive 1.2.1 included with HDP 2.5.0
Hive configured to support ACID transactions and demo database and customer_demographics created using customer_demographics_orc_table_ddl.hql

Steps

Import NiFi Template

Assuming NiFi is started and the UI available at <NiFiMasterHostName>:8086:/nifi, import the template CSVToHive.xml:

screen-shot-2017-03-06-at-74106-pm.png

Create Data Folder and Upload Data Files

In your home directory create /home/username/customer_demographics and upload data files specified above. Grant appropriate access to your NiFi user to be able to access it and process it via GetFile processor. Change the directory path specified in GetFile processor to match your path. Also, change the "Keep Source File" property of the GetFile processor to false as such the file is processed once and then deleted. For test reasons, I kept it as true. also, you will have to adjust Hive Metastore URI to match your environment host name.

Import Zeppelin Notebook

Execute NiFi Flow

Start all processors or start one processor at the time and follow the flow. The outcome is that each record of your CSV file will be posted to Hive demo.customer_demographics table via Hive Streaming API. As you noticed from the DDL, the Hive table is transactional. Enabling the global ACID feature of Hive and creating the table as transactional and bucketed is a requirement for this to work. Also, the data format required to allow using PutHiveStreaming processor is Avro, as such we converted the CSV to Avro. At one of the intermediary steps we could infer the Avro schema or define the CSV file header, the later option has been selected for this demo.

Execute Zeppelin Notebook

During the demo you could change from NiFi to Zeppelin showing how the data is posted in Hive and how is reflected in Zeppelin by re-executing the HiveQL blocks. The markdown (md) and shell (sh) blocks were included only for demonstration purposes, showing how a data engineer, a data analyst or a data scientist can benefit from the use of Zeppelin.

Cloudera Community

Community Articles

Customer Demographics Demo with Apache Nifi, Hive and Zeppelin

Apache Hive

Apache NiFi

Apache Zeppelin