Community Articles

alim · ‎10-04-2017

Objective

This tutorial walks you through a NiFi flow that:

Uses the LookupRecord processor to parse NiFi provenance events in JSON format and add geolocation data
Uses the PartitionRecord processor to group like records by State
Publishes records originating from California to Kafka

This article is the first of a two part series. We will setup the demo environment including flows, controller services and reporting tasks. The second article will walk through the main flow step by step.

Environment

This tutorial was tested using the following environment and components:

Mac OS X 10.11.6
Apache NiFi 1.3.0
Apache Kafka 0.10.2.1

Environment Configuration

Kafka Setup

In the bin directory of your Kafka install:

Start ZooKeeper: ./zookeeper-server-start.sh ../config/zookeeper.properties

Start Kafka: ./kafka-server-start.sh ../config/server.properties

Create Kafka Topic: ./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic California

Start Kafka Consumer: ./kafka-console-consumer.sh --zookeeper localhost:2181 --topic California --from-beginning

NiFi Instances Configuration

For this tutorial, you need two NiFi instances running. One instance generates and sends provenance data to the other via the SiteToSiteProvenanceReportingTask.

Instructions on how to setup both instances can be found in the HCC article "Extracting NiFi Provenance Data using SiteToSiteProvenanceReportingTask".

Main Dataflow Instance Setup

In the instance that will use the provenance data (http://localhost:8088/nifi), import the following template:

lookuprecord-geoenrich.xml

You should see the following flow on your NiFi canvas:

First, let's get the MaxMind Database file that is used to enrich the data. This is done by the flow contained within the "Gather Enrichment Data" process group.

Run the flow and the file GeoLite2-City.mmdb should be downloaded locally into a directory named "enrichment-db" within your NiFi installation.

Now, let's enable the flow's controller services. Select the gear icon from the Operate Palette:

This opens the NiFi Flow Configuration window. Select the Controller Services tab:

Enable AvroSchemaRegistry by selecting the lightning bolt icon/button. This will then allow you to enable the JsonTreeReader and JSONRecordSetWriter controller services. Select the lightning bolt icons for both of these services as well as the IPLookupService controller service. All the controller services should be enabled at this point:

We will step through the main flow in detail in the second article. For now, start only the "Provenance In" input port.

Provenance Event Generating Instance

In the instance that generates provenance data (http://localhost:8080/nifi), import the following template:

fetchsites.xml

The following flow should be on the NiFi canvas:

The two GetHTTP processors are configured as follows:

The UpdateAttribute processor is configured with all default settings:

Now, let's create the SiteToSiteProvenance reporting task.

Select the Global menu and choose "Controller Settings":

Select the Reporting Tasks tab and click the "+" icon:

Select SiteToSiteProvenanceReportingTask and click the "Add" button:

Configure the reporting task as follows:

On the Settings tab, set "Run Schedule" to 5 seconds:

(Note: Some of these settings are for demo purposes only and may need to be adjusted if run in a production environment.)

Start the reporting task:

Return to the NiFi canvas and start the flow to generate provenance data:

Run the flow for 30 secs or so to generate sufficient provenance events. Stop the flow.

Switch to your other NiFi instance. You should see flowfiles queued after the "Provenance In" input port:

We are now ready to geo enrich the provenance data.

Continue to the second article for a detailed walk through of the LookupRecord flow.

Cloudera Community