Community Articles

TimothySpann · ‎11-03-2016

Starting My Hadoop Tools

NiFi can interface directly with Hive, HDFS, HBase, Flume and Phoenix. And I can also trigger Spark and Flink through Kafka and Site-To-Site. Sometimes I need to run some Pig scripts. Apache Pig is very stable and has a lot of functions and tools that make for some smart processing. You can easily augment and add this piece to a larger pipeline or part of the process.

Pig Setup

I like to use Ambari to install the HDP 2.5 clients on my NiFi box to have access to all the tools I may need.

Then I can just do:

yum install pig

Pig to Apache NiFi 1.0.0

ExecuteProcess

We call a shell script that wraps the Pig script.

Output of script is stored to HDFS: hdfs dfs -ls /nifi-logs

Shell Script

export JAVA_HOME=/opt/jdk1.8.0_101/ 
pig -x local -l /tmp/pig.log -f /opt/demo/pigscripts/test.pig

You can run in different Pig modes like local, mapreduce and tez. You can also pass in parameters or the script.

Pig Script

messages = LOAD '/opt/demo/HDF/centos7/tars/nifi/nifi-1.0.0.2.0.0.0-579/logs/nifi-app.log';
warns = FILTER messages BY $0 MATCHES '.*WARN+.*';
DUMP warns
store warns into 'warns.out'

This is a basic example from the internet, with the NIFI 1.0 log used as the source.

As an aside, I run a daily script with the schedule 1 * * * * ? to clean up my logs.

Simply: /bin/rm -rf /opt/demo/HDF/centos7/tars/nifi/nifi-1.0.0.2.0.0.0-579/logs/*2016*

PutHDFS

Hadoop Configuration: /etc/hadoop/conf/core-site.xml

Pick a directory and store away.

Results

HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures
2.7.3.2.5.0.0-12450.16.0.2.5.0.0-1245root2016-11-03 19:53:572016-11-03 19:53:59FILTER
Success!
Job Stats (time in seconds):
JobIdMapsReducesMaxMapTimeMinMapTimeAvgMapTimeMedianMapTimeMaxReduceTimeMinReduceTimeAvgReduceTimeMedianReducetimeAliasFeatureOutputs
job_local72884441_000110n/an/an/an/a0000messages,warnsMAP_ONLYfile:/tmp/temp1540654561/tmp-600070101,
Input(s):
Successfully read 30469 records from: "/opt/demo/HDF/centos7/tars/nifi/nifi-1.0.0.2.0.0.0-579/logs/nifi-app.log"
Output(s):
Successfully stored 1347 records in: "file:/tmp/temp1540654561/tmp-600070101"
Counters:
Total records written : 1347
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_local72884441_0001

Reference:

Cloudera Community

Community Articles

Running Apache Pig Scripts from Apache NiFi and Storing the Results in HDFS

Apache NiFi

Apache Pig

How to use query result in another query Apache Ni...

Integrating Apache Spark 2.x Jobs with Apache NiFi...

Versioned DataFlows with Apache NiFi 1.5 and Apach...

Apache NiFi 1.1.0 on Docker

DevOps Tips: Using the Apache NiFi Toolkit with A...

Record based processors in Apache NiFi 1.2

Apache Deep Learning 101: Processing Apache MXNet...

Executing TensorFlow Classifications from Apache N...

Detecting Language with Apache NiFi

Apache NiFi - Part 2 (Twitter Flow)