Created 06-03-2021 03:17 AM
Hi,
According to the official Nifi documentation, the state allows Nifi processors to "resume from the place where it left off after NiFi is restarted. Additionally, it allows for a Processor to store some piece of information so that the Processor can access that information from all of the different nodes in the cluster".
If my understanding is good, when we configure a zookeeper Provider, the state will not be persisted locally, instead, the data will be sent to zookeeper.
I've explored the zookeeper znodes and could not find any data related to the state, all I can find are the informations about the Coordinator and Primary nodes. However, the local state directory is still filled.
The configuration is very simple, I've 3 external ZK nodes and 3 Nifi instances.
Here is an exerpt of the nifi.properties file:
nifi.cluster.is.node=true
nifi.zookeeper.connect.string=zk-node1:2181,zk-node2:2181,zk-node3:2181
nifi.state.management.embedded.zookeeper.start=false
nifi.state.management.provider.cluster=zk-provider
And here is an exerpt of the state-management.xml file:
<cluster-provider>
<id>zk-provider</id>
<class>org.apache.nifi.controller.state.providers.zookeeper.ZooKeeperStateProvider</class>
<property name="Connect String">zk-node1:2181,zk-node2:2181,zk-node3:2181</property>
<property name="Root Node">/nifi</property>
<property name="Session Timeout">10 seconds</property>
<property name="Access Control">Open</property>
</cluster-provider>
When I try to ls the Zookeeper, I can see only 2 znodes: "components" but this znode is empty and the "leaders" zonde which contain some data about the Nifi Coordinator and Primary Nodes.
Also, when I explore the transactions logs, even after using some load balanced connections, I cannot find anything related to the Nifi State.
Could somebody explain what data goes the Zookeeper and why the local state directory is still filled even if we configure the zk provider ?
Thanks.
Created 06-03-2021 09:45 AM
@khaldoune
NiFi state is used by only NiFi components/frameworks bits that are built to use it.
Some select components can be configured to use local state even if you are setup with a NiFi cluster. Others select components will use cluster state if NiFi is clustered, but in a standalone NiFi use local state.
You can refer to the embedded documentation for each component processor, controller service, or reporting task to see if it uses the state provider.
For example, look at the embedded docs for ListFile and you will see a "state management:" section with:
Above you see that this processor can use local or cluster state provider and a description of how state us utilized by this component. Configuration of the "Input Directory Location" in the case of this specific processor controls which provider is used.
For components that do not use state (bulk of components don't use state), the same section in their embedded docs will reflect:
Load balanced connections to do not record state. Load balanced connections copy FlowFiles from one node to another and on confirmation of success, the local copies are removed. So if NiFi is shutdown or dies while data is being copied by a load balanced connection the source NiFi will simply start over distributing the FlowFiles again when it is back online in the cluster.
If you found this addressed your query, please take a moment to login and click "Accept" on this solution.
Thank you,
Matt
Created 06-03-2021 09:45 AM
@khaldoune
NiFi state is used by only NiFi components/frameworks bits that are built to use it.
Some select components can be configured to use local state even if you are setup with a NiFi cluster. Others select components will use cluster state if NiFi is clustered, but in a standalone NiFi use local state.
You can refer to the embedded documentation for each component processor, controller service, or reporting task to see if it uses the state provider.
For example, look at the embedded docs for ListFile and you will see a "state management:" section with:
Above you see that this processor can use local or cluster state provider and a description of how state us utilized by this component. Configuration of the "Input Directory Location" in the case of this specific processor controls which provider is used.
For components that do not use state (bulk of components don't use state), the same section in their embedded docs will reflect:
Load balanced connections to do not record state. Load balanced connections copy FlowFiles from one node to another and on confirmation of success, the local copies are removed. So if NiFi is shutdown or dies while data is being copied by a load balanced connection the source NiFi will simply start over distributing the FlowFiles again when it is back online in the cluster.
If you found this addressed your query, please take a moment to login and click "Accept" on this solution.
Thank you,
Matt
Created 06-04-2021 12:50 PM
Hi @MattWho ,
Thanks for your clear answer.
I'd like understand what happens if we use a processor that supports only the "cluster" state on one (or many) standalone Nifi instance(s).
Will the state be persisted locally or will it be ignored ?
Thanks.
Created on 06-04-2021 01:33 PM - edited 06-04-2021 01:33 PM
For a standalone NiFi (meaning that "nifi.cluster.is.node" is set to false in nifi.properties file), components (processors, controller services, and reporting tasks) that write state will use local state directory to record state.
Problem here is that if you switch to being clustered later, there is no way to move the components state from local to zookeeper.
NOTE: It is possible to have a 1 node NiFi cluster (offers no HA control plane that way), but it will still require that you have a zookeeper quorum.
Hope this helps,
Matt
Created 06-04-2021 01:44 PM
Some components that maintain state do so because they were developed with intent of being used in NiFi cluster setup to support non cluster friendly protocols.
Example (getting data from SFTP server):
In Standalone NiFi you would use the GetSFTP processor (does not record state).
In Cluster NiFi you would use the ListSFTP (records state) and FetchSFTP processor to do the same task.
The ListSFTP processor would be configured to execute on "primary node" only. That way you do not have every node in your cluster trying to list the same files on yoru target SFTP processor. Then the success from listSFTP which simply has FlowFiles with no content and only metadata/attributes is connected to a FetchSFTP processor. That connection between those two processors would be configured to load balance those FlowFiles to all nodes. Now the heavy work of ingesting the actual content for each of those listed FlowFiles is spread across all nodes in the cluster.
Even if you use above processors in a standalone, they will still record state.
Cluster state is generally stored to help when a primary node change occurs. That way the newly elected primary node that now starts executing the primary node only configured processors, will have those processor fetch that last known state from ZK so that it does not list the same files already listed by previous primary node.
Just some more context for you on how state is used primarily by components and why.
Matt