Community Articles

Wynner · ‎01-13-2016

NiFi/HDF Dataflow Optimization

Who is This Guide For?

This guide is for Dataflow Managers (DFM) very comfortable with NiFi and a good knowledge of the underlying system on which NiFi is running. This guide also assumes that the user has applied the recommend changes in the Configuration Best Practices section of the Admin Guide https://nifi.apache.org/docs.html and applied appropriate changes recommended in the setup best practices here: https://community.hortonworks.com/content/kbentry/7882/hdfnifi-best-practices-for-setting-up-a-high-.... This guide is not intended to be an exhaustive manual for Dataflow Optimization, but more of points to consider when optimizing their own dataflow.

What is meant by dataflow optimization?

Dataflow Optimization isn’t an exact science with hard and fast rules that will always apply. But more of a balance between system resources (memory, network, disk space, disk speed and CPU), the number of files and size of those files, the types of processors used, the size of dataflow that has been designed, and the underlying configuration of NiFi on the system.

Group common functionality when and where it makes sense

A simple approach to dataflow optimization is to group repeated operations into a Process Group . This will optimize the flow by removing redundant operations. Then pass the data through the group and then continue through the flow. When repeating the same process in multiple places on the graph, try to put the functionality into a single group.

Use the fewest number of processors

The simplest approach to dataflow optimization is to use the fewest number of processors possible to accomplish the task. How does this translate in to better NiFi performance? Many NiFi processors support batch processing per thread. With multiple processors working on smaller subsets of a large dataset you are not taking advantage of the batch processing capability. For example 4 GetSFTP processors each pulling from separate directories, with the same parent directory, with 20 files will use 4 threads. While 1 GetSFTP processor will get a listing, from the parent directory, for all files in those sub directories and then pull them all under 1 thread.

Here is a flow snippet illustrating this example:

In this case with a single GetSFTP processor, make sure the property Search Recursively is set to true:

Configuration of RouteOnAttribute processor being used to separate data pulled by one GetSFTP processor from four different directories using the standard flow file attribute ${path}:

If data were being pushed to NiFi, a similar way to identify the data would be utilizing a user-supplied attribute. This method would work for data coming in from any type of processor that receives data in the flow. In the example below, the data is coming in via a ListenHTTP processor from another NiFi instance with a user added ${remote-sensor} attribute. The receiving NiFi uses that attribute to make a routing decision.

In addition, you can tag data that is already combined using an UpdateAttribute processor so it can be managed through a shared downstream data path that will make use of batch processing. Using a very similar example, say the data is coming in from multiple sensor sources without the user-supplied tag already added. How would NiFi identify the data? In the flow figure below NiFi can use the attribute restlistener.remote.user.dn to identify the source of the data and add the appropriate tag:

Configuration of an UpdateAttribute processor to use inherent attribute, restlistener.remote.user.dn, in the ListenHTTP processor, the first figure is the properties tab of the processor, the second figure shows the rules inside the advanced tab:

--------------------------------------------------------------------------------------------

Part 2 of this tutorial can be found here: NiFi/HDF Dataflow Optimization part 2

Cloudera Community

Community Articles

NiFi/HDF Dataflow Optimization (Part 1 of 2)

Apache NiFi

Cloudera DataFlow (CDF)

NiFi/HDF Dataflow Optimization

Who is This Guide For?

What is meant by dataflow optimization?

Group common functionality when and where it makes sense

Use the fewest number of processors