Support Questions
Find answers, ask questions, and share your expertise

Can NIFI nodes access different records on a Database

Rising Star

Hi,

I have 3-node Nifi cluster.

Please, i would like to ask if each node of the cluster can output different results (i.e no duplicates) from one another.

SCENARIO:

Let say i have 10 records on a DB.

If node1 output the first record, i want node2 or node3 not to output first record again.

 

Thank you.

2 ACCEPTED SOLUTIONS

Expert Contributor

Hi,

It all depends how the output is getting generated. you probably need the processor that fetch the DB record to run on primary node first and then do load balancing downstream to process each record on a different node.

View solution in original post

Master Guru

@rafy 
Each node in a NiFi Cluster has its own copy of the dataflow and executes independently of the other nodes.  Some NiFi components are capable of writing cluster state to zookeeper to avoid data duplication across nodes.  Those NiFi ingest type components that support this should be configured to execute on primary node only. 

In a NiFi cluster, one node will be elected as the primary node (which node is elected can change at any time).  So if a primary node change happens, the same component on a different node will now get scheduled and will retrieve cluster state to avoid ingesting same data again.  Often in these types of components, the one that records sate does not typically retrieve the content.  It simply generates metadata/attributes necessary to later get the content with the expectation that in your flow design you distribute those FlowFiles across all nodes before content retrieval. 

For example:
- ListSFTP (primary node execution) --> success connection (with round robin LB configuration) --> FetchSFTP (all node execution)

The ListSFTP creates a 0 byte FlowFIle for each source file that will be fetched.  The FetchSFTP processor uses that metadata/attributes to get the actual source content and add it to the FlowFile. 

Another example your query might be:
GenerateTableFetch (primary node execution)  --> LB connection --> ExecuteSQL

 

The goal with these dataflows is to void having one node ingest all the content (added network and Disk I/O) only to then add more network and disk I/O to spread that content across all nodes.  So instead we simply get details about the data to be fetched so that can be distributed across all nodes, so each nodes gets only specific portions of the source data.

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt

 

View solution in original post

4 REPLIES 4

Expert Contributor

Hi,

It all depends how the output is getting generated. you probably need the processor that fetch the DB record to run on primary node first and then do load balancing downstream to process each record on a different node.

Rising Star

Thank you sir.

I thought same way as much. I was just thinking may be there is a better way to go about it.

Please, do you have any other method?

Master Guru

@rafy 
Each node in a NiFi Cluster has its own copy of the dataflow and executes independently of the other nodes.  Some NiFi components are capable of writing cluster state to zookeeper to avoid data duplication across nodes.  Those NiFi ingest type components that support this should be configured to execute on primary node only. 

In a NiFi cluster, one node will be elected as the primary node (which node is elected can change at any time).  So if a primary node change happens, the same component on a different node will now get scheduled and will retrieve cluster state to avoid ingesting same data again.  Often in these types of components, the one that records sate does not typically retrieve the content.  It simply generates metadata/attributes necessary to later get the content with the expectation that in your flow design you distribute those FlowFiles across all nodes before content retrieval. 

For example:
- ListSFTP (primary node execution) --> success connection (with round robin LB configuration) --> FetchSFTP (all node execution)

The ListSFTP creates a 0 byte FlowFIle for each source file that will be fetched.  The FetchSFTP processor uses that metadata/attributes to get the actual source content and add it to the FlowFile. 

Another example your query might be:
GenerateTableFetch (primary node execution)  --> LB connection --> ExecuteSQL

 

The goal with these dataflows is to void having one node ingest all the content (added network and Disk I/O) only to then add more network and disk I/O to spread that content across all nodes.  So instead we simply get details about the data to be fetched so that can be distributed across all nodes, so each nodes gets only specific portions of the source data.

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt

 

Rising Star

Thank you so much Mr @MattWho for comprehensive explanation on what Mr @SAMSAL earlier proposed. 

I really learn alot & appreciate. 

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.