Created 05-12-2018 10:27 PM
Hey Guys,
We are running Nifi on a cluster of two nodes. In the last 5 weeks, the flow.xml.gz file has gotten out of sync twice. What are some situations that will defeat Nifi's replication of this file? Is there anything we can do to mitigate the likelihood of this happening again?
Created 05-15-2018 01:19 PM
It is likely the flow is becoming out of sync as a result of one of the following scenarios:
-
1. A change replication request was made to all nodes in the cluster. One or more nodes failed to process that request in the configured allowable nifi.cluster.node.connection.timeout and/or nifi.cluster.node.read.timeout. By default these values are set very low. Recommend setting these at a minimum of 30 secs. Additional reasons why a replication request may fail can range from network issues (high latency or packet loss) to the size of the replication request (snippets) (pasting/copying large selection of canvas components or instantiating a large template. Each node must complete adding every component, connection, controller service, etc in that snippet before responding to the replication request)
-
2. A node was in a disconnected state. Every cluster has a cluster coordinator which every node sends heartbeats to. If a node is disconnected from the cluster (through user manual action, or because request failure above), that node's UI cab still be accessed. It will display the work "disconnected" on status bar, but will still allow a user who is connected directly to that node's UI to make changes. So if an attempt is made later (after making some change) to reconnect this node to the cluster, it will fail because flows do not match between cluster and this node anymore. User's who are connected to anyone of the other nodes still in the cluster will see a missing node in the status bar (for example: 1/2). From these node's UIs changes will not be allowed because node is missing.
-
It is best to look at nifi-app.log or user the cluster UI in NiFi to see why a node was disconnected in the first place.
-
Users should be educated on how NiFi works as described above and should avoid making changes to canvas when "disconnected" is displayed on NiFi UI status bar above canvas.
-
NiFi allows changes to be made on disconnected nodes by design. The intent here is to allow users to disconnect a node to perform troubleshooting of a misbehaving node or to perform some temporary one off testing that they do not want to perform across entire cluster. IN such cases, those changes must be backed out before being able to rejoin cluster. Another option is to rename/remove flow.xml.gz on that disconnected node. On restart, in the absence of a flow.xml.gz file, the connecting node will inherit the flow.xml.gz from the cluster.
-
Thanks,
Matt
Created 05-15-2018 01:19 PM
It is likely the flow is becoming out of sync as a result of one of the following scenarios:
-
1. A change replication request was made to all nodes in the cluster. One or more nodes failed to process that request in the configured allowable nifi.cluster.node.connection.timeout and/or nifi.cluster.node.read.timeout. By default these values are set very low. Recommend setting these at a minimum of 30 secs. Additional reasons why a replication request may fail can range from network issues (high latency or packet loss) to the size of the replication request (snippets) (pasting/copying large selection of canvas components or instantiating a large template. Each node must complete adding every component, connection, controller service, etc in that snippet before responding to the replication request)
-
2. A node was in a disconnected state. Every cluster has a cluster coordinator which every node sends heartbeats to. If a node is disconnected from the cluster (through user manual action, or because request failure above), that node's UI cab still be accessed. It will display the work "disconnected" on status bar, but will still allow a user who is connected directly to that node's UI to make changes. So if an attempt is made later (after making some change) to reconnect this node to the cluster, it will fail because flows do not match between cluster and this node anymore. User's who are connected to anyone of the other nodes still in the cluster will see a missing node in the status bar (for example: 1/2). From these node's UIs changes will not be allowed because node is missing.
-
It is best to look at nifi-app.log or user the cluster UI in NiFi to see why a node was disconnected in the first place.
-
Users should be educated on how NiFi works as described above and should avoid making changes to canvas when "disconnected" is displayed on NiFi UI status bar above canvas.
-
NiFi allows changes to be made on disconnected nodes by design. The intent here is to allow users to disconnect a node to perform troubleshooting of a misbehaving node or to perform some temporary one off testing that they do not want to perform across entire cluster. IN such cases, those changes must be backed out before being able to rejoin cluster. Another option is to rename/remove flow.xml.gz on that disconnected node. On restart, in the absence of a flow.xml.gz file, the connecting node will inherit the flow.xml.gz from the cluster.
-
Thanks,
Matt
Created 01-17-2024 01:15 AM
Hi Matt, good explanation here. Just a quick follow up, is there a way to force sync between the nodes in the cluster. Let's say from the primary node?
Thanks in advanced.
Greetz,
Dave