Support Questions

Find answers, ask questions, and share your expertise

ListFile primary node change

avatar
Master Guru

The documentation for ListFile states:

If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data.

How does the "new" primary node pick up where the previous node left off without flow file duplication? I ask since the previous primary node may have the file flow, when new primary node is elected how does it primary node get the flow file without duplicating or cloning it?

1 ACCEPTED SOLUTION

avatar
Master Guru

I believe that statement was referring specifically to the listing operation performed by ListFile, and not the overall state of the primary node.

For example, if ListFile ran and listed files 1 and 2, and then the primary node changes and file 3 is available, it won't list files 1 and 2 again, it will start with 3, but if files 1 and 2 were in progress in the rest of the flow on the original primary node when it went down, they are stuck there until that node comes back up.

The state tracking is done through the state management API which uses Zookeeper when clustered. I believe in this case it is using timestamps track the last time the processor ran and the timestamp of the most recent file it saw, and then looking for files newer than that on the next execution.

Also keep in mind, this whole scenario only makes sense when listing a remote directory that all nodes in the NiFi cluster have access to, otherwise it doesn't make sense when listing a local directory that is only on one node.

View solution in original post

1 REPLY 1

avatar
Master Guru

I believe that statement was referring specifically to the listing operation performed by ListFile, and not the overall state of the primary node.

For example, if ListFile ran and listed files 1 and 2, and then the primary node changes and file 3 is available, it won't list files 1 and 2 again, it will start with 3, but if files 1 and 2 were in progress in the rest of the flow on the original primary node when it went down, they are stuck there until that node comes back up.

The state tracking is done through the state management API which uses Zookeeper when clustered. I believe in this case it is using timestamps track the last time the processor ran and the timestamp of the most recent file it saw, and then looking for files newer than that on the next execution.

Also keep in mind, this whole scenario only makes sense when listing a remote directory that all nodes in the NiFi cluster have access to, otherwise it doesn't make sense when listing a local directory that is only on one node.