Support Questions

DGaboleiro · ‎10-20-2022

Hi,

I want to execute a flow in two NiFi nodes and I'm fetching data from Cassandra running in the primary node only, but then I have a DistributeLoad with "round robin" strategy to distribute the data for the two nodes.

The rest of the processors are with Execution on "All Nodes", is the DistributeLoad redundant? Meaning, if I don't use the DistributeLoad but still use the Execution on "All Nodes" for the rest of the processors, I can still have parallel processing jobs across the nodes? Or are the FlowFiles only processed in the current node?

Thanks

MattWho · ‎10-21-2022

@DGaboleiro

I am a bit confused by yoru dataflow design.

In a NiFi multi-node cluster, each node is only aware of and can only execute upon FlowFiles present on that one node. So in your Dataflow you have the QueryCasandra processor executing on "primary node" only as you should (having it execute on all nodes would result in both your nodes performing same query and returning same data). You then Split that Json and use a DistributeLoad processor for what appears to me as means to then send some FlowFIle to node 1 and other half to node 2. This is not the best way to do this. You are running Apache NiFi 1.17 which means that load balanced connections are possible that can accomplish the same without all these additional processors.
https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#settings

After your FlowFiles (this is what is being moved from processor to processor on your canvas) have been distributed I see that you are a MergeContent processor. The MergeContent processor can only merge the FlowFiles present on the same node. It will not merge FlowFiles from multiple nodes to a single FlowFile. So if your desire is to have one merge of all FlowFiles, distributing them across multiple nodes will not give you that desired outcome.

You should never configure any processor that accepts an inbound connection for "primary node" only execution. This is important since which node is elected as primary node can change at anytime. Execution strategy has nothing to do with the availability of FlowFiles on each node on which to execute. What is important to understand is that each node in yoru NiFi cluster has its own copy of the Flow, its own set of Content and FlowFile repositories contain unique data, and each nodes executes the processors in its flow with no regard of the existence of other nodes. A node is simply aware from Zookeeper if it has been elected as the cluster coordinator and/or primary node. If it is elected primary node, it will execute "primary node" and "all nodes" components. If it is not the primary node, it will only execute the "all nodes" components.

If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.

Thank you,

Matt

Cloudera Community

Support Questions

Distribute Loading with Execution on "All Nodes"