Support Questions

GrazittiAPI · ‎06-28-2022

Hi,

I have 3-node Nifi cluster.

Please, i would like to ask if each node of the cluster can output different results (i.e no duplicates) from one another.

SCENARIO:

Let say i have 10 records on a DB.

If node1 output the first record, i want node2 or node3 not to output first record again.

Thank you.

SAMSAL · ‎06-28-2022

Hi,

It all depends how the output is getting generated. you probably need the processor that fetch the DB record to run on primary node first and then do load balancing downstream to process each record on a different node.

View solution in original post

MattWho · ‎06-29-2022

@rafy
Each node in a NiFi Cluster has its own copy of the dataflow and executes independently of the other nodes. Some NiFi components are capable of writing cluster state to zookeeper to avoid data duplication across nodes. Those NiFi ingest type components that support this should be configured to execute on primary node only.

In a NiFi cluster, one node will be elected as the primary node (which node is elected can change at any time). So if a primary node change happens, the same component on a different node will now get scheduled and will retrieve cluster state to avoid ingesting same data again. Often in these types of components, the one that records sate does not typically retrieve the content. It simply generates metadata/attributes necessary to later get the content with the expectation that in your flow design you distribute those FlowFiles across all nodes before content retrieval.

For example:
- ListSFTP (primary node execution) --> success connection (with round robin LB configuration) --> FetchSFTP (all node execution)

The ListSFTP creates a 0 byte FlowFIle for each source file that will be fetched. The FetchSFTP processor uses that metadata/attributes to get the actual source content and add it to the FlowFile.

Another example your query might be:
GenerateTableFetch (primary node execution) --> LB connection --> ExecuteSQL

The goal with these dataflows is to void having one node ingest all the content (added network and Disk I/O) only to then add more network and disk I/O to spread that content across all nodes. So instead we simply get details about the data to be fetched so that can be distributed across all nodes, so each nodes gets only specific portions of the source data.

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt

View solution in original post

SAMSAL · ‎06-28-2022

Hi,

It all depends how the output is getting generated. you probably need the processor that fetch the DB record to run on primary node first and then do load balancing downstream to process each record on a different node.

rafy · ‎06-28-2022

Thank you sir.

I thought same way as much. I was just thinking may be there is a better way to go about it.

Please, do you have any other method?

MattWho · ‎06-29-2022

@rafy
Each node in a NiFi Cluster has its own copy of the dataflow and executes independently of the other nodes. Some NiFi components are capable of writing cluster state to zookeeper to avoid data duplication across nodes. Those NiFi ingest type components that support this should be configured to execute on primary node only.

In a NiFi cluster, one node will be elected as the primary node (which node is elected can change at any time). So if a primary node change happens, the same component on a different node will now get scheduled and will retrieve cluster state to avoid ingesting same data again. Often in these types of components, the one that records sate does not typically retrieve the content. It simply generates metadata/attributes necessary to later get the content with the expectation that in your flow design you distribute those FlowFiles across all nodes before content retrieval.

For example:
- ListSFTP (primary node execution) --> success connection (with round robin LB configuration) --> FetchSFTP (all node execution)

The ListSFTP creates a 0 byte FlowFIle for each source file that will be fetched. The FetchSFTP processor uses that metadata/attributes to get the actual source content and add it to the FlowFile.

Another example your query might be:
GenerateTableFetch (primary node execution) --> LB connection --> ExecuteSQL

The goal with these dataflows is to void having one node ingest all the content (added network and Disk I/O) only to then add more network and disk I/O to spread that content across all nodes. So instead we simply get details about the data to be fetched so that can be distributed across all nodes, so each nodes gets only specific portions of the source data.

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt

rafy · ‎06-29-2022

Thank you so much Mr @MattWho for comprehensive explanation on what Mr @SAMSAL earlier proposed.

I really learn alot & appreciate.

Cloudera Community

Support Questions

Can NIFI nodes access different records on a Database