Support Questions

Find answers, ask questions, and share your expertise

NiFi Flow Design recommendation

avatar
Contributor

I am new to NiFi, as per my understanding primary node is used to run isolated processes. eg. if we have a process to get data from ftp dir so it is better to keep it as primary node so that there is no load on ftp server. For a scenario where in I have multiple get processors for e.g. ftp and other to get data from a db so won’t the primary node’s performance be a bottleneck since we can not have more than 1 primary node. Below are my queries :

1. Do we have any NiFi flow design patterns for the above scenario.

2. If we do not make get processors as primary node there is a possibility of fetching duplicate data please clarify.

1 ACCEPTED SOLUTION

avatar
@riyer

For question 1, there are best practices depending on how data is being pulled/pushed into a cluster which leads to the answer for question 2.

For question 2, if you configure the get processor to run on the primary node, there can only be one primary node at any time, so there would not be duplicate data.

But a better solution that would take advantage of a cluster would be to use a ListFTP and FetchFtp processors. Run the ListFTP processor on the primary node and then distribute the list of files to all of the nodes in the cluster, via a remote process group back to itself, to a FetchFTP processor and that way each node will be pulling files from the source with no risk of duplicate data.

View solution in original post

3 REPLIES 3

avatar
@riyer

For question 1, there are best practices depending on how data is being pulled/pushed into a cluster which leads to the answer for question 2.

For question 2, if you configure the get processor to run on the primary node, there can only be one primary node at any time, so there would not be duplicate data.

But a better solution that would take advantage of a cluster would be to use a ListFTP and FetchFtp processors. Run the ListFTP processor on the primary node and then distribute the list of files to all of the nodes in the cluster, via a remote process group back to itself, to a FetchFTP processor and that way each node will be pulling files from the source with no risk of duplicate data.

avatar
Contributor

@Wynner But what if there are multiple sources for get so doesn't primary node become a bottleneck? Do we have solution or pattern for the same?

avatar
@riyer

In the scenario I described, you could run the FetchFTP processor on all of the nodes, thus no bottleneck on the primary node.