Support Questions

sunile_manjee · ‎03-13-2017

The documentation for ListFile states:

If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data.

How does the "new" primary node pick up where the previous node left off without flow file duplication? I ask since the previous primary node may have the file flow, when new primary node is elected how does it primary node get the flow file without duplicating or cloning it?

bbende · ‎03-13-2017

I believe that statement was referring specifically to the listing operation performed by ListFile, and not the overall state of the primary node.

For example, if ListFile ran and listed files 1 and 2, and then the primary node changes and file 3 is available, it won't list files 1 and 2 again, it will start with 3, but if files 1 and 2 were in progress in the rest of the flow on the original primary node when it went down, they are stuck there until that node comes back up.

The state tracking is done through the state management API which uses Zookeeper when clustered. I believe in this case it is using timestamps track the last time the processor ran and the timestamp of the most recent file it saw, and then looking for files newer than that on the next execution.

Also keep in mind, this whole scenario only makes sense when listing a remote directory that all nodes in the NiFi cluster have access to, otherwise it doesn't make sense when listing a local directory that is only on one node.

View solution in original post

bbende · ‎03-13-2017

I believe that statement was referring specifically to the listing operation performed by ListFile, and not the overall state of the primary node.

For example, if ListFile ran and listed files 1 and 2, and then the primary node changes and file 3 is available, it won't list files 1 and 2 again, it will start with 3, but if files 1 and 2 were in progress in the rest of the flow on the original primary node when it went down, they are stuck there until that node comes back up.

The state tracking is done through the state management API which uses Zookeeper when clustered. I believe in this case it is using timestamps track the last time the processor ran and the timestamp of the most recent file it saw, and then looking for files newer than that on the next execution.

Also keep in mind, this whole scenario only makes sense when listing a remote directory that all nodes in the NiFi cluster have access to, otherwise it doesn't make sense when listing a local directory that is only on one node.

Cloudera Community

Support Questions

ListFile primary node change