question Nifi getting hung on invalid Zookeeper hostname in Support Questions

question Nifi getting hung on invalid Zookeeper hostname in Support Questions https://community.cloudera.com/t5/Support-Questions/Nifi-getting-hung-on-invalid-Zookeeper-hostname/m-p/282804#M210207 I am running a kubernetes cluster with three nodes, each running a Nifi pod (nifi-0, nifi-1, nifi-2) and a Zookeeper pod (zk-0, zk-1, zk-2).  Everything worked.  These are the relevant lines from nifi.properties:       <LI-CODE lang="markup">nifi.state.management.embedded.zookeeper.start=false nifi.zookeeper.connect.string=zk-0.nifi.svc.cluster.local:2181,zk-1.nifi.svc.cluster.local:2181,zk-2.nifi.svc.cluster.local:2181</LI-CODE>       Today, one of the nodes crashed, taking out nifi-0 and zk-1 which were both running on it.  For reasons completely unrelated, kubernetes was unable to spin up a new node to replace it.  However, with two Nifi and two Zookeeper pods still running, it is my understanding that this should not have been a problem.   Zookeeper seems to work fine.  "/opt/zookeeper/bin/zkServer.sh status" on zk-0 reports that it is the leader, and zk-2 reports that it is the follower.   But Nifi is not working. Attempting to connect to the UI just returns the "Action cannot be performed because there is currently no Cluster Coordinator elected. The request should be tried again after a moment, after a Cluster Coordinator has been automatically elected."     nifi-2 believes that it is the cluster leader:      <LI-CODE lang="markup">nifi-app_2019-11-12_19.0.log:2019-11-12 19:31:32,875 INFO [Leader Election Notification Thread-1] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@59188519 This node has been elected Leader for Role 'Primary Node' nifi-app_2019-11-12_19.0.log:2019-11-12 19:31:32,875 INFO [Leader Election Notification Thread-4] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1905a793 This node has been elected Leader for Role 'Cluster Coordinator' nifi-app_2019-11-12_19.0.log:2019-11-12 19:31:32,876 INFO [Leader Election Notification Thread-4] o.apache.nifi.controller.FlowController This node elected Active Cluster Coordinator nifi-app_2019-11-12_19.0.log:2019-11-12 19:31:32,876 INFO [Leader Election Notification Thread-1] o.apache.nifi.controller.FlowController This node has been elected Primary Node</LI-CODE>     nifi-1, though, believes that there is no cluster leader.  It is stuck in a cycle of trying to connect to zk-1.nifi.svc.cluster.local; however, because that pod no longer exists (it crashed), this is no longer a resolveable name (kubernetes manages the DNS in this regard).  The exact error is:       <LI-CODE lang="java">2019-11-12 23:15:18,232 ERROR [Leader Election Notification Thread-1] o.a.c.f.imps.CuratorFrameworkImpl Background exception was not retry-able or retry gave up java.net.UnknownHostException: zk-1.nifi.svc.cluster.local at java.net.InetAddress.getAllByName0(InetAddress.java:1281) at java.net.InetAddress.getAllByName(InetAddress.java:1193) at java.net.InetAddress.getAllByName(InetAddress.java:1127) at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61) at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445) at org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29) at org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:150) at org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94) at org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:91) at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:116) at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:835) at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:507) at org.apache.curator.framework.imps.FindAndDeleteProtectedNodeInBackground.execute(FindAndDeleteProtectedNodeInBackground.java:60) at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:496) at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:474) at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:44) at org.apache.curator.framework.recipes.locks.StandardLockInternalsDriver.createsTheLock(StandardLockInternalsDriver.java:50) at org.apache.curator.framework.recipes.locks.LockInternals.attemptLock(LockInternals.java:217) at org.apache.curator.framework.recipes.locks.InterProcessMutex.internalLock(InterProcessMutex.java:232) at org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:89) at org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:386) at org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:441) at org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:64) at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:245) at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:239) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)</LI-CODE>       This is the only thing in nifi-app.log for hours; it does not seem like it is attempting to connect to zk-0 or zk-2 at all.  nifi-2 does not have these errors.   So, the questions: 1. Should Nifi be able to handle the situation where it can't resolve or connect to a Zookeeper address? 2. Is there any reason why a Nifi node might get "stuck" on a particular Zookeeper instance, or not attempt to try other instances?     Wed, 13 Nov 2019 06:36:36 GMT cmcguigan 2019-11-13T06:36:36Z