Support Questions

MattWho · ‎06-29-2022

@rafy
Each node in a NiFi Cluster has its own copy of the dataflow and executes independently of the other nodes. Some NiFi components are capable of writing cluster state to zookeeper to avoid data duplication across nodes. Those NiFi ingest type components that support this should be configured to execute on primary node only.

In a NiFi cluster, one node will be elected as the primary node (which node is elected can change at any time). So if a primary node change happens, the same component on a different node will now get scheduled and will retrieve cluster state to avoid ingesting same data again. Often in these types of components, the one that records sate does not typically retrieve the content. It simply generates metadata/attributes necessary to later get the content with the expectation that in your flow design you distribute those FlowFiles across all nodes before content retrieval.

For example:
- ListSFTP (primary node execution) --> success connection (with round robin LB configuration) --> FetchSFTP (all node execution)

The ListSFTP creates a 0 byte FlowFIle for each source file that will be fetched. The FetchSFTP processor uses that metadata/attributes to get the actual source content and add it to the FlowFile.

Another example your query might be:
GenerateTableFetch (primary node execution) --> LB connection --> ExecuteSQL

The goal with these dataflows is to void having one node ingest all the content (added network and Disk I/O) only to then add more network and disk I/O to spread that content across all nodes. So instead we simply get details about the data to be fetched so that can be distributed across all nodes, so each nodes gets only specific portions of the source data.

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt

View solution in original post

Cloudera Community

Support Questions

Who agreed with this solution