Created 09-08-2017 08:44 PM
We are seeing the primary node change pretty frequently and with DEBUG we noted the following error. Any ideas on how to resolve this or improve it?
2017-09-08 16:24:48,326 DEBUG [CommitProcessor:2] o.a.z.server.FinalRequestProcessor Processing request:: sessionid:0x25e631231920001 type:getData cxid:0x552 zxid:0xfffffffffffffffe txntype:unknown reqpath:/nifi/leaders/Cluster Coordinator/_c_436facb3-d463-4782-ada5-48d11856bfdf-lock-0000000092
2017-09-08 16:24:48,326 DEBUG [CommitProcessor:2] o.a.z.server.FinalRequestProcessor sessionid:0x25e631231920001 type:getData cxid:0x552 zxid:0xfffffffffffffffe txntype:unknown reqpath:/nifi/leaders/Cluster Coordinator/_c_436facb3-d463-4782-ada5-48d11856bfdf-lock-0000000092
2017-09-08 16:24:48,661 INFO [Process Cluster Protocol Request-8] o.a.n.c.p.impl.SocketProtocolListener Finished processing request 65c0e064-6313-4fed-ae4d-57cbf0fec692 (type=HEARTBEAT, length=4809 bytes) from dcwipphnif005.edc.nam.gm.com:8443 in 331 millis
2017-09-08 16:24:48,771 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
2017-09-08 16:24:48,773 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@40da1bc7 Connection State changed to SUSPENDED
2017-09-08 16:24:48,773 DEBUG [Replicate Request Thread-1197] org.apache.curator.RetryLoop Retry-able exception received
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /nifi/leaders/Cluster Coordinator/_c_436facb3-d463-4782-ada5-48d11856bfdf-lock-0000000092
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:310)
at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:299)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:108)
at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:295)
at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:287)
at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:34)
at org.apache.curator.framework.recipes.leader.LeaderSelector.participantForPath(LeaderSelector.java:375)
at org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:346)
at org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:339)
at org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager.getLeader(CuratorLeaderElectionManager.java:217)
at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorAddress(NodeClusterCoordinator.java:174)
at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorNode(NodeClusterCoordinator.java:460)
at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorNode(NodeClusterCoordinator.java:454)
at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.isActiveClusterCoordinator(NodeClusterCoordinator.java:542)
at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.afterRequest(NodeClusterCoordinator.java:965)
at org.apache.nifi.cluster.coordination.http.replication.ThreadPoolRequestReplicator.onCompletedResponse(ThreadPoolRequestReplicator.java:702)
at org.apache.nifi.cluster.coordination.http.replication.ThreadPoolRequestReplicator.lambda$replicate$19(ThreadPoolRequestReplicator.java:382)
at org.apache.nifi.cluster.coordination.http.replication.StandardAsyncClusterResponse.add(StandardAsyncClusterResponse.java:307)
at org.apache.nifi.cluster.coordination.http.replication.ThreadPoolRequestReplicator.lambda$replicate$21(ThreadPoolRequestReplicator.java:425)
at org.apache.nifi.cluster.coordination.http.replication.ThreadPoolRequestReplicator$NodeHttpRequest.run(ThreadPoolRequestReplicator.java:831)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
2017-09-08 16:24:48,775 DEBUG [Clustering Tasks Thread-1] org.apache.curator.RetryLoop Retry-able exception received
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /nifi/leaders/Cluster Coordinator
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1590)
at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:230)
at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:219)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:108)
at org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:215)
at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:207)
at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:40)
at org.apache.curator.framework.recipes.locks.LockInternals.getSortedChildren(LockInternals.java:151)
at org.apache.curator.framework.recipes.locks.LockInternals.getParticipantNodes(LockInternals.java:133)
at org.apache.curator.framework.recipes.locks.InterProcessMutex.getParticipantNodes(InterProcessMutex.java:170)
at org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:338)
at org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager.getLeader(CuratorLeaderElectionManager.java:217)
at org.apache.nifi.controller.cluster.ClusterProtocolHeartbeater.getHeartbeatAddress(ClusterProtocolHeartbeater.java:63)
at org.apache.nifi.controller.cluster.ClusterProtocolHeartbeater.send(ClusterProtocolHeartbeater.java:75)
at org.apache.nifi.controller.FlowController$HeartbeatSendTask.run(FlowController.java:4245)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
2017-09-08 16:24:48,775 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@2df79081 Connection State changed to SUSPENDED
2017-09-08 16:24:48,775 INFO [Leader Election Notification Thread-1] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@40da1bc7 has been interrupted; no longer leader for role 'Cluster Coordinator'
2017-09-08 16:24:48,776 INFO [Leader Election Notification Thread-2] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@2df79081 has been interrupted; no longer leader for role 'Primary Node'
2017-09-08 16:24:48,776 INFO [Leader Election Notification Thread-2] o.a.n.c.l.e.CuratorLeaderElectionManager
Created 09-09-2017 07:22 PM
Check these properties, the default values are 3 seconds, change them to 30 seconds and see if it helps
nifi.zookeeper.connect.timeout nifi.zookeeper.session.timeout
I would also check these properties, the default values are 5 seconds, change them to 30 seconds also
nifi.cluster.node.connection.timeout nifi.cluster.node.read.timeout
Finally, check this property, change it from the default of 10 to 40 or 50
nifi.cluster.node.protocol.threads