Created on 01-13-201606:56 PM - edited 08-17-201901:31 PM
NiFi/HDF Dataflow Optimization
Who is This Guide For?
This guide is for Dataflow Managers (DFM) very comfortable
with NiFi and a good knowledge of the underlying system on which NiFi is
running. This guide also assumes that
the user has applied the recommend changes in the Configuration Best Practices section
of the Admin Guide https://nifi.apache.org/docs.html and applied appropriate changes recommended in the setup
best practices here: https://community.hortonworks.com/content/kbentry/7882/hdfnifi-best-practices-for-setting-up-a-high-.... This guide
is not intended to be an exhaustive manual for Dataflow Optimization, but more
of points to consider when optimizing their own dataflow.
What is meant by dataflow optimization?
Dataflow Optimization isn’t an exact science with hard and
fast rules that will always apply. But
more of a balance between system resources (memory, network, disk space, disk
speed and CPU), the number of files and size of those files, the types of
processors used, the size of dataflow that has been designed, and the
underlying configuration of NiFi on the system.
Group common functionality when and where it makes sense
A simple approach to dataflow optimization is to group
repeated operations into a Process Group . This will optimize the flow by
removing redundant operations. Then pass
the data through the group and then continue through the flow. When repeating the same process in multiple
places on the graph, try to put the functionality into a single group.
Use the fewest number of processors
The simplest approach to dataflow optimization is to use the
fewest number of processors possible to accomplish the task. How does this
translate in to better NiFi performance?
Many NiFi processors support batch processing per thread. With multiple
processors working on smaller subsets of a large dataset you are not taking
advantage of the batch processing capability.
For example 4 GetSFTP processors each pulling from separate directories,
with the same parent directory, with 20 files will use 4 threads. While 1 GetSFTP processor will get a listing,
from the parent directory, for all files in those sub directories and then pull
them all under 1 thread.
Here is a flow snippet
illustrating this example:
In this case with a single GetSFTP processor, make sure the
property Search Recursively is set
to true:
Configuration of RouteOnAttribute processor being used to
separate data pulled by one GetSFTP processor from four different directories
using the standard flow file attribute ${path}:
If data were being pushed to NiFi, a similar way to identify
the data would be utilizing a user-supplied attribute. This method would work
for data coming in from any type of processor that receives data in the flow. In
the example below, the data is coming in via a ListenHTTP processor from
another NiFi instance with a user added ${remote-sensor}
attribute. The receiving NiFi uses that attribute to make a routing decision.
In addition, you can tag data that is already combined using
an UpdateAttribute processor so it can be managed through a shared downstream
data path that will make use of batch processing. Using a very similar example, say the data is
coming in from multiple sensor sources without the user-supplied tag already
added. How would NiFi identify the
data? In the flow figure below NiFi can
use the attribute restlistener.remote.user.dn
to identify the source of the data and add the appropriate tag:
Configuration of an UpdateAttribute processor to use
inherent attribute, restlistener.remote.user.dn,
in the ListenHTTP processor, the first figure is the properties tab of the
processor, the second figure shows the rules inside the advanced tab: