Support Questions

ask_bill_brooks · ‎03-22-2020

Hi Guys

I need to know the following, thanks in advance:

i) Do List processors w/ timestamp tracking store state locally?

ii) Does this state survive NiFi restarts?

iii) If running on primary node only, would this mean when another primary node is chosen, the List processor would list any files it hasn't tracked (and re-ingress a large backlog of files if still there)?

iv) What out-of-box solutions can help to get around the issue of non-persisted non-distributed listing, or do we need custom auditing triggering individual listings?

MattWho · ‎03-24-2020

@domR

i) Do List processors w/ timestamp tracking store state locally?
--- If you are running a standalone NiFi and not a NiFi cluster, all state will be stored locally on disk.

--- If clustered, this depends on the list processor and how it is configured. The ListFile processor can be configured to store state locally or remotely depending on your use case. For example a ListFile is added to a NiFi cluster and every node is listing from a local path not shared across all nodes, you would want each node to store the listFile state locally since it would be unique per node and other nodes have no access to the directory one each node. If your listFile is listing against a mounted directory that is mounted to every node in the cluster, the listFile should be configured for remote and configured t run on primary node only.
--- Other list based processors all store state locally ONLY when it is a standalone NiFi. Clustered NiFi installs will trigger store to be stored in zookeeper.

ii) Does this state survive NiFi restarts?
--- Yes, local state is stored on disk in NiFI's local state directory. Cluster/Remote state is store in zookeeper. State configurable is handled by the state-management.xml configuration file.

iii) If running on primary node only, would this mean when another primary node is chosen, the List processor would list any files it hasn't tracked (and re-ingress a large backlog of files if still there)?
--- When a primary node change occurs, the primary node only processors on the previous primary node are asked to stop executing and the same processors on the newly elected primary node are asked to start. On the new node, that processor will retrieve that last known state stored in zookeeper for that component before executing. There is a small chance for some limited data duplication. When old elected primary node processors are asked to stop that does not kill active threads. If the processor is in the middle of execution and does not complete (update cluster state in ZK) before newly elected primary node pulls cluster state when it starts to execute, some files may be listed again by newly elected node, but it will not list from beginning.

iv) What out-of-box solutions can help to get around the issue of non-persisted non-distributed listing, or do we need custom auditing triggering individual listings?
--- NiFi does persist state through node restarts.

Note; You can right click on a processor that stores state and select "view state" to see what has been stored. You can also right click on a processor and select "view usage" to open the embedded documentation for that component. The embedded documentation will contain a "State Management:" section that will tell you if the component stores state and if that state is stored locally or cluster (ZK).

Hope this helps,

Matt

View solution in original post

MattWho · ‎03-24-2020

@domR

i) Do List processors w/ timestamp tracking store state locally?
--- If you are running a standalone NiFi and not a NiFi cluster, all state will be stored locally on disk.

--- If clustered, this depends on the list processor and how it is configured. The ListFile processor can be configured to store state locally or remotely depending on your use case. For example a ListFile is added to a NiFi cluster and every node is listing from a local path not shared across all nodes, you would want each node to store the listFile state locally since it would be unique per node and other nodes have no access to the directory one each node. If your listFile is listing against a mounted directory that is mounted to every node in the cluster, the listFile should be configured for remote and configured t run on primary node only.
--- Other list based processors all store state locally ONLY when it is a standalone NiFi. Clustered NiFi installs will trigger store to be stored in zookeeper.

ii) Does this state survive NiFi restarts?
--- Yes, local state is stored on disk in NiFI's local state directory. Cluster/Remote state is store in zookeeper. State configurable is handled by the state-management.xml configuration file.

iii) If running on primary node only, would this mean when another primary node is chosen, the List processor would list any files it hasn't tracked (and re-ingress a large backlog of files if still there)?
--- When a primary node change occurs, the primary node only processors on the previous primary node are asked to stop executing and the same processors on the newly elected primary node are asked to start. On the new node, that processor will retrieve that last known state stored in zookeeper for that component before executing. There is a small chance for some limited data duplication. When old elected primary node processors are asked to stop that does not kill active threads. If the processor is in the middle of execution and does not complete (update cluster state in ZK) before newly elected primary node pulls cluster state when it starts to execute, some files may be listed again by newly elected node, but it will not list from beginning.

iv) What out-of-box solutions can help to get around the issue of non-persisted non-distributed listing, or do we need custom auditing triggering individual listings?
--- NiFi does persist state through node restarts.

Note; You can right click on a processor that stores state and select "view state" to see what has been stored. You can also right click on a processor and select "view usage" to open the embedded documentation for that component. The embedded documentation will contain a "State Management:" section that will tell you if the component stores state and if that state is stored locally or cluster (ZK).

Hope this helps,

Matt

domR · ‎03-24-2020

Thanks for clearing this up Matt, was a big help.

Cheers,
Dom

Cloudera Community

Support Questions

NiFi - List SFTP / HDFS Processors - State