Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (1)
New Contributor

Most data movement use cases do not require a “shuffle phase” for redistributing FlowFiles across a NiFi cluster, but there are few cases where it is useful. For example:

  • ListFile -> FetchFile
  • ListHDFS -> FetchHDFS
  • ListFTP -> FetchFTP
  • GenerateTableFetch -> ExecuteSQL
  • GetSQS -> FetchS3

In each case, the flow starts with a processor that generates tasks to run (e.g. filenames) followed by the actual execution of those tasks. To scale, tasks need to run on each node in the NiFi cluster, but for consistency, the task generation should only run on the primary node. The solution is to introduce a shuffle (aka load balancing) step in between task generation and task execution.

Processors can be configured to run on the primary node by going to “View Configuration”-> “Scheduling” and selecting “Primary node only” under “Execution”.

The shuffle step is not an explicit component on the NiFi canvas, but rather the combination of a Remote Input Port and a Remote Process Group pointing at the local cluster. FlowFiles that are sent to the Remote Process Group will be load balanced over Site-to-Site and come back into the flow via the Remote Input Port. Under “Manage Remote Ports” on the Remote Process Group there are batch settings that help control the load balancing.

Here are two example flows that use this design pattern:

91630-nifi-shuffle.png

249 Views
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
2 of 2
Last update:
‎08-17-2019 06:18 AM
Updated by:
 
Contributors
Top Kudoed Authors