Created on 11-12-2019 03:57 PM - last edited on 11-12-2019 10:36 PM by ask_bill_brooks
I am running a kubernetes cluster with three nodes, each running a Nifi pod (nifi-0, nifi-1, nifi-2) and a Zookeeper pod (zk-0, zk-1, zk-2). Everything worked. These are the relevant lines from nifi.properties:
nifi.state.management.embedded.zookeeper.start=false
nifi.zookeeper.connect.string=zk-0.nifi.svc.cluster.local:2181,zk-1.nifi.svc.cluster.local:2181,zk-2.nifi.svc.cluster.local:2181
Today, one of the nodes crashed, taking out nifi-0 and zk-1 which were both running on it. For reasons completely unrelated, kubernetes was unable to spin up a new node to replace it. However, with two Nifi and two Zookeeper pods still running, it is my understanding that this should not have been a problem.
Zookeeper seems to work fine. "/opt/zookeeper/bin/zkServer.sh status" on zk-0 reports that it is the leader, and zk-2 reports that it is the follower.
But Nifi is not working. Attempting to connect to the UI just returns the "Action cannot be performed because there is currently no Cluster Coordinator elected. The request should be tried again after a moment, after a Cluster Coordinator has been automatically elected."
nifi-2 believes that it is the cluster leader:
nifi-app_2019-11-12_19.0.log:2019-11-12 19:31:32,875 INFO [Leader Election Notification Thread-1] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@59188519 This node has been elected Leader for Role 'Primary Node'
nifi-app_2019-11-12_19.0.log:2019-11-12 19:31:32,875 INFO [Leader Election Notification Thread-4] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1905a793 This node has been elected Leader for Role 'Cluster Coordinator'
nifi-app_2019-11-12_19.0.log:2019-11-12 19:31:32,876 INFO [Leader Election Notification Thread-4] o.apache.nifi.controller.FlowController This node elected Active Cluster Coordinator
nifi-app_2019-11-12_19.0.log:2019-11-12 19:31:32,876 INFO [Leader Election Notification Thread-1] o.apache.nifi.controller.FlowController This node has been elected Primary Node
nifi-1, though, believes that there is no cluster leader. It is stuck in a cycle of trying to connect to zk-1.nifi.svc.cluster.local; however, because that pod no longer exists (it crashed), this is no longer a resolveable name (kubernetes manages the DNS in this regard). The exact error is:
2019-11-12 23:15:18,232 ERROR [Leader Election Notification Thread-1] o.a.c.f.imps.CuratorFrameworkImpl Background exception was not retry-able or retry gave up
java.net.UnknownHostException: zk-1.nifi.svc.cluster.local
at java.net.InetAddress.getAllByName0(InetAddress.java:1281)
at java.net.InetAddress.getAllByName(InetAddress.java:1193)
at java.net.InetAddress.getAllByName(InetAddress.java:1127)
at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
at org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29)
at org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:150)
at org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94)
at org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55)
at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:91)
at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:116)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:835)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:507)
at org.apache.curator.framework.imps.FindAndDeleteProtectedNodeInBackground.execute(FindAndDeleteProtectedNodeInBackground.java:60)
at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:496)
at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:474)
at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:44)
at org.apache.curator.framework.recipes.locks.StandardLockInternalsDriver.createsTheLock(StandardLockInternalsDriver.java:50)
at org.apache.curator.framework.recipes.locks.LockInternals.attemptLock(LockInternals.java:217)
at org.apache.curator.framework.recipes.locks.InterProcessMutex.internalLock(InterProcessMutex.java:232)
at org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:89)
at org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:386)
at org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:441)
at org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:64)
at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:245)
at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:239)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This is the only thing in nifi-app.log for hours; it does not seem like it is attempting to connect to zk-0 or zk-2 at all. nifi-2 does not have these errors.
So, the questions:
1. Should Nifi be able to handle the situation where it can't resolve or connect to a Zookeeper address?
2. Is there any reason why a Nifi node might get "stuck" on a particular Zookeeper instance, or not attempt to try other instances?
Created 11-13-2019 08:07 AM
Update: After killing the pod having the problem, now both pods are reporting the "Unable to resolve zk-1" error over and over and neither one believes there is a cluster coordinator. Zookeeper itself still seems to be working fine.