Community Articles

shjelmfelt · ‎10-04-2018

Most data movement use cases do not require a “shuffle phase” for redistributing FlowFiles across a NiFi cluster, but there are few cases where it is useful. For example:

ListFile -> FetchFile
ListHDFS -> FetchHDFS
ListFTP -> FetchFTP
GenerateTableFetch -> ExecuteSQL
GetSQS -> FetchS3

In each case, the flow starts with a processor that generates tasks to run (e.g. filenames) followed by the actual execution of those tasks. To scale, tasks need to run on each node in the NiFi cluster, but for consistency, the task generation should only run on the primary node. The solution is to introduce a shuffle (aka load balancing) step in between task generation and task execution.

Processors can be configured to run on the primary node by going to “View Configuration”-> “Scheduling” and selecting “Primary node only” under “Execution”.

The shuffle step is not an explicit component on the NiFi canvas, but rather the combination of a Remote Input Port and a Remote Process Group pointing at the local cluster. FlowFiles that are sent to the Remote Process Group will be load balanced over Site-to-Site and come back into the flow via the Remote Input Port. Under “Manage Remote Ports” on the Remote Process Group there are batch settings that help control the load balancing.

Here are two example flows that use this design pattern:

Cloudera Community

Community Articles

NiFi Shuffle - Design Pattern

Apache NiFi