Community Articles

TimothySpann · ‎01-11-2017

There's a number of tools from the command line, that I like to use from NiFi as part of a Big Data Flow.

The first tool I wanted to use was Ghostscript.

On Centos/RHEL, you can install it via:

yum install ghostscript

I use GhostScript to extract content from PDFs (this can be passed in from an existing flow using ExecuteStreamCommand). It then outputs to the Standard Output the text from those files.

run.sh

gs    -dBATCH    -dNOPAUSE    -sDEVICE=txtwrite    -dFirstPage=1    -dLastPage=500    -sOutputFile=- $@

Output from the Hadoop documentation:

GPL Ghostscript 9.07 (2013-02-14)
Copyright (C) 2012 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 8.
Page 1
Can't find (or can't open) font file NimbusSanL-ReguItal.
Querying operating system for font files...
Loading NimbusSanL-ReguItal font from /usr/share/fonts/default/Type1/n019023l.pfb... 3984660 2473586 2498328 1197484 3 done.
Loading NimbusSanL-Bold font from /usr/share/fonts/default/Type1/n019004l.pfb... 4025652 2573865 2498328 1199061 3 done.
Loading NimbusRomNo9L-Regu font from /usr/share/fonts/default/Type1/n021003l.pfb... 4072164 2714750 2518512 1214631 3 done.
Welcome to Apache™ Hadoop��!
Table of contents
1 What Is Apache Hadoop?.................................................................................................. 2
2 Getting Started .................................................................................................................. 3
3 Download Hadoop..............................................................................................................3
4 Who Uses Hadoop?............................................................................................................3
5 News................................................................................................................................... 3
Copyright �� 2014 The Apache Software Foundation. All rights reserved.
Page 2
Loading NimbusSanL-Regu font from /usr/share/fonts/default/Type1/n019003l.pfb... 4243344 2902744 2478144 1170708 3 done.
Loading NimbusRomNo9L-Medi font from /usr/share/fonts/default/Type1/n021004l.pfb... 4410848 3063836 2518512 1208517 3 done.
Welcome to Apache™ Hadoop��!
1 What Is Apache Hadoop?
The Apache™ Hadoop�� project develops open-source software for reliable, scalable,
distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming models.
It is designed to scale up from single servers to thousands of machines, each offering local
computation and storage. Rather than rely on hardware to deliver high-availability, the
library itself is designed to detect and handle failures at the application layer, so delivering
a highly-available service on top of a cluster of computers, each of which may be prone to
failures.
The project includes these modules:
• Hadoop Common: The common utilities that support the other Hadoop modules.
• Hadoop Distributed File System (HDFS™): A distributed file system that provides
high-throughput access to application data.
• Hadoop YARN: A framework for job scheduling and cluster resource management.
• Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Other Hadoop-related projects at Apache include:

You will probably want to clean this up a little and remove some of the formatting, this can be done in NiFi or later in Hive or Phoenix for further cleaning. Or you could send it as a message through Kafka and process with Apache Storm, Apache Spark or other streaming tools.

For fans of old UNIX, everyone loved those fortunes. Those are still available for install on CentOS.

yum install fortune-mod.x86_64

The results of a flow calling fortune, this requires no parameters and just put in the command parameter of fortune.   It outputs the information to the console, which we extract using (.*).+ to an attribute and then I convert it to a JSON file for storage in HDFS.

Output JSON

<strong></strong>{"fortune":"My little brother got this fortune"}

Reference:

TimothySpann · ‎01-11-2017

ghost.xml

NIFI Template

Cloudera Community

Community Articles

Basic Image Processing and Linux Utilities As Part of a Big Data Flow

Apache NiFi

Re: Basic Image Processing and Linux Utilities As Part of a Big Data Flow