Created on 10-04-201808:11 PM - edited 08-17-201906:18 AM
Most
data movement use cases do not require a “shuffle phase” for redistributing
FlowFiles across a NiFi cluster, but there are few cases where it is useful.
For example:
ListFile
-> FetchFile
ListHDFS
-> FetchHDFS
ListFTP
-> FetchFTP
GenerateTableFetch
-> ExecuteSQL
GetSQS
-> FetchS3
In
each case, the flow starts with a processor that generates tasks to run (e.g.
filenames) followed by the actual execution of those tasks. To scale, tasks
need to run on each node in the NiFi cluster, but for consistency, the task
generation should only run on the primary node. The solution is to introduce a
shuffle (aka load balancing) step in between task generation and task
execution.
Processors
can be configured to run on the primary node by going to “View
Configuration”-> “Scheduling” and selecting “Primary node only” under
“Execution”.
The
shuffle step is not an explicit component on the NiFi canvas, but rather the
combination of a Remote Input Port and a Remote Process Group pointing at the
local cluster. FlowFiles that are sent to the Remote Process Group will be load
balanced over Site-to-Site and come back into the flow via the Remote Input
Port. Under “Manage Remote Ports” on the Remote Process Group there are batch
settings that help control the load balancing.
Here
are two example flows that use this design pattern: