Support Questions

Find answers, ask questions, and share your expertise

Primary Node Changing Often / ConnectionLoss

avatar

We are seeing the primary node change pretty frequently and with DEBUG we noted the following error. Any ideas on how to resolve this or improve it?

2017-09-08 16:24:48,326 DEBUG [CommitProcessor:2] o.a.z.server.FinalRequestProcessor Processing request:: sessionid:0x25e631231920001 type:getData cxid:0x552 zxid:0xfffffffffffffffe txntype:unknown reqpath:/nifi/leaders/Cluster Coordinator/_c_436facb3-d463-4782-ada5-48d11856bfdf-lock-0000000092

2017-09-08 16:24:48,326 DEBUG [CommitProcessor:2] o.a.z.server.FinalRequestProcessor sessionid:0x25e631231920001 type:getData cxid:0x552 zxid:0xfffffffffffffffe txntype:unknown reqpath:/nifi/leaders/Cluster Coordinator/_c_436facb3-d463-4782-ada5-48d11856bfdf-lock-0000000092

2017-09-08 16:24:48,661 INFO [Process Cluster Protocol Request-8] o.a.n.c.p.impl.SocketProtocolListener Finished processing request 65c0e064-6313-4fed-ae4d-57cbf0fec692 (type=HEARTBEAT, length=4809 bytes) from dcwipphnif005.edc.nam.gm.com:8443 in 331 millis

2017-09-08 16:24:48,771 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: SUSPENDED

2017-09-08 16:24:48,773 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@40da1bc7 Connection State changed to SUSPENDED

2017-09-08 16:24:48,773 DEBUG [Replicate Request Thread-1197] org.apache.curator.RetryLoop Retry-able exception received

org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /nifi/leaders/Cluster Coordinator/_c_436facb3-d463-4782-ada5-48d11856bfdf-lock-0000000092

at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)

at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)

at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)

at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:310)

at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:299)

at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:108)

at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:295)

at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:287)

at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:34)

at org.apache.curator.framework.recipes.leader.LeaderSelector.participantForPath(LeaderSelector.java:375)

at org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:346)

at org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:339)

at org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager.getLeader(CuratorLeaderElectionManager.java:217)

at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorAddress(NodeClusterCoordinator.java:174)

at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorNode(NodeClusterCoordinator.java:460)

at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorNode(NodeClusterCoordinator.java:454)

at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.isActiveClusterCoordinator(NodeClusterCoordinator.java:542)

at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.afterRequest(NodeClusterCoordinator.java:965)

at org.apache.nifi.cluster.coordination.http.replication.ThreadPoolRequestReplicator.onCompletedResponse(ThreadPoolRequestReplicator.java:702)

at org.apache.nifi.cluster.coordination.http.replication.ThreadPoolRequestReplicator.lambda$replicate$19(ThreadPoolRequestReplicator.java:382)

at org.apache.nifi.cluster.coordination.http.replication.StandardAsyncClusterResponse.add(StandardAsyncClusterResponse.java:307)

at org.apache.nifi.cluster.coordination.http.replication.ThreadPoolRequestReplicator.lambda$replicate$21(ThreadPoolRequestReplicator.java:425)

at org.apache.nifi.cluster.coordination.http.replication.ThreadPoolRequestReplicator$NodeHttpRequest.run(ThreadPoolRequestReplicator.java:831)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:748)

2017-09-08 16:24:48,775 DEBUG [Clustering Tasks Thread-1] org.apache.curator.RetryLoop Retry-able exception received

org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /nifi/leaders/Cluster Coordinator

at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)

at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)

at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1590)

at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:230)

at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:219)

at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:108)

at org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:215)

at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:207)

at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:40)

at org.apache.curator.framework.recipes.locks.LockInternals.getSortedChildren(LockInternals.java:151)

at org.apache.curator.framework.recipes.locks.LockInternals.getParticipantNodes(LockInternals.java:133)

at org.apache.curator.framework.recipes.locks.InterProcessMutex.getParticipantNodes(InterProcessMutex.java:170)

at org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:338)

at org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager.getLeader(CuratorLeaderElectionManager.java:217)

at org.apache.nifi.controller.cluster.ClusterProtocolHeartbeater.getHeartbeatAddress(ClusterProtocolHeartbeater.java:63)

at org.apache.nifi.controller.cluster.ClusterProtocolHeartbeater.send(ClusterProtocolHeartbeater.java:75)

at org.apache.nifi.controller.FlowController$HeartbeatSendTask.run(FlowController.java:4245)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:748)

2017-09-08 16:24:48,775 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@2df79081 Connection State changed to SUSPENDED

2017-09-08 16:24:48,775 INFO [Leader Election Notification Thread-1] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@40da1bc7 has been interrupted; no longer leader for role 'Cluster Coordinator'

2017-09-08 16:24:48,776 INFO [Leader Election Notification Thread-2] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@2df79081 has been interrupted; no longer leader for role 'Primary Node'

2017-09-08 16:24:48,776 INFO [Leader Election Notification Thread-2] o.a.n.c.l.e.CuratorLeaderElectionManager

1 REPLY 1

avatar
@Wesley Bohannon

Check these properties, the default values are 3 seconds, change them to 30 seconds and see if it helps

nifi.zookeeper.connect.timeout 
nifi.zookeeper.session.timeout

I would also check these properties, the default values are 5 seconds, change them to 30 seconds also

nifi.cluster.node.connection.timeout 
nifi.cluster.node.read.timeout

Finally, check this property, change it from the default of 10 to 40 or 50

nifi.cluster.node.protocol.threads