Created 04-07-2017 05:18 AM
HI, We work on NIFI 1.0.0.2.0.0.0-579(HDF) with 3 nodes. The cluster work fine almost 4 months until 3 days ago. The behavior is the node suddenly disconnected and connected, and the primary node and coordinator node changed often.
What is the cause? cannot connect the zookeeper? or heavy-load? or another?
The below is the common errors:
--------------------------------------
2017-04-07 11:22:44,055 ERROR [Leader Election Notification Thread-4] o.a.c.f.recipes.leader.LeaderSelector The leader threw an exception java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) ~[na:1.8.0_92] at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) ~[na:1.8.0_92] at org.apache.curator.CuratorZookeeperClient.internalBlockUntilConnectedOrTimedOut(CuratorZookeeperClient.java:325) ~[curator-client-2.11.0.jar:na]
-----------------------------------------
2017-04-07 10:34:33,416 ERROR [Leader Election Notification Thread-1] o.a.c.f.recipes.leader.LeaderSelector The leader threw an exception java.lang.IllegalMonitorStateException: You do not own the lock: /nifi/leaders/Cluster Coordinator at org.apache.curator.framework.recipes.locks.InterProcessMutex.release(InterProcessMutex.java:140) ~[curator-recipes-2.11.0.jar:na]
---------------------------------------------
2017-04-07 09:14:26,386 ERROR [Leader Election Notification Thread-4] o.a.c.f.recipes.leader.LeaderSelector The leader threw an exception java.lang.IllegalMonitorStateException: You do not own the lock: /nifi/leaders/Primary Node
-----------------------------------
2017-04-06 16:28:31,555 ERROR [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl Background operation retry gave up org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.6.jar:3.4.6-1569965] at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:728) [curator-framework-2.11.0.jar:na]
Created 04-07-2017 01:10 PM
The election of a Primary node and the Cluster Coordinator occurs through Zookeeper. Once a Cluster Coordinator is elected, all nodes will begin sending heartbeats directly to the elected primary node. If a heartbeat is not received in the configured threshold, that node will be disconnected.
A single node disconnecting / reconnecting node may indicate a problem with just that single node.(Network latency between node and Cluster coordinator, garbage collection (stop the world event) that prevents node from heartbeating to cluster coordinator, etc... Check the NiFi app log on your nodes to make sure they are sending heartbeats regularly.)
In your case you mention the Cluster Coordinator changes nodes frequently. This means that a new node is being elected as the Cluster Coordinator by zookeeper. This occurs when the current cluster coordinator has trouble communicating with zookeeper. Again, garbage collection can be the cause.
Their is a known bug in HDF 2.0 / NiFi 1.0 (https://issues.apache.org/jira/browse/NIFI-2999) that can result in all nodes being disconnected when the cluster coordinator changes hosts. Since nodes send heartbeats directly to the current cluster coordinator, whomever is the current cluster coordinator keeps track of when the last heartbeat was received from a node. Lets assume a 3 node cluster (Node A, B, and C) Node A is current Cluster coordinator and is receiving heartbeats. At some point later Node B becomes the cluster coordinator and all nodes start sending heartbeats there. The bug which has been addressed occurs if at some point later Node A should become the cluster coordinator again. When that happens Node A looks at the last time it received heartbeats which it has since it was previously the cluster coordinator,but since they are all old, every node gets disconnected. They then auto-reconnect on next heartbeat.
You can upgrade to get away from this bug (HDF 2.1 / NiFi 1.1), but ultimately you need to address the issue that is causing the cluster coordinator to change nodes. This is either a loading issue where there are insufficient resource to maintain a connection with zookeeper, an overloaded zookeeper, a zookeeper that does not have quorum, Node garbage collection issue resulting in to long of a lapse between zookeeper connections, etc...
Thanks,
Matt
Created 04-07-2017 01:10 PM
The election of a Primary node and the Cluster Coordinator occurs through Zookeeper. Once a Cluster Coordinator is elected, all nodes will begin sending heartbeats directly to the elected primary node. If a heartbeat is not received in the configured threshold, that node will be disconnected.
A single node disconnecting / reconnecting node may indicate a problem with just that single node.(Network latency between node and Cluster coordinator, garbage collection (stop the world event) that prevents node from heartbeating to cluster coordinator, etc... Check the NiFi app log on your nodes to make sure they are sending heartbeats regularly.)
In your case you mention the Cluster Coordinator changes nodes frequently. This means that a new node is being elected as the Cluster Coordinator by zookeeper. This occurs when the current cluster coordinator has trouble communicating with zookeeper. Again, garbage collection can be the cause.
Their is a known bug in HDF 2.0 / NiFi 1.0 (https://issues.apache.org/jira/browse/NIFI-2999) that can result in all nodes being disconnected when the cluster coordinator changes hosts. Since nodes send heartbeats directly to the current cluster coordinator, whomever is the current cluster coordinator keeps track of when the last heartbeat was received from a node. Lets assume a 3 node cluster (Node A, B, and C) Node A is current Cluster coordinator and is receiving heartbeats. At some point later Node B becomes the cluster coordinator and all nodes start sending heartbeats there. The bug which has been addressed occurs if at some point later Node A should become the cluster coordinator again. When that happens Node A looks at the last time it received heartbeats which it has since it was previously the cluster coordinator,but since they are all old, every node gets disconnected. They then auto-reconnect on next heartbeat.
You can upgrade to get away from this bug (HDF 2.1 / NiFi 1.1), but ultimately you need to address the issue that is causing the cluster coordinator to change nodes. This is either a loading issue where there are insufficient resource to maintain a connection with zookeeper, an overloaded zookeeper, a zookeeper that does not have quorum, Node garbage collection issue resulting in to long of a lapse between zookeeper connections, etc...
Thanks,
Matt
Created 05-04-2017 08:02 AM
Thanks.