Created 06-27-2017 10:08 PM
Hello,
I have a 3 node cluster all using the NiFi version nifi-1.2.0.3.0.0.0-453. The cluster has been working fine for the last couple of weeks, however today all of a sudden one of the nodes disconnected from the cluster and won't join the cluster back. I checked the logs and the error I see is the following:
ERROR [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl Background retry gave up org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:838) at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809) at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64) at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2017-06-27 17:54:40,179 ERROR [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl Background operation retry gave up org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:728) at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:857) at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809) at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64) at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
The node itself appears to still be running but is disconnected from the cluster. I tried restarting it but the same error keeps appearing over and over again. The other two nodes in the cluster are working fine. Does anyone have any idea of what could be causing this sudden issue? Any insights would be greatly appreciated.
Created 06-28-2017 11:49 PM
It looks like you are seeing this issue: Background retry falls into infinite loop of reconnection after connection loss
Are all of the zookeeper instances running? Are you seeing any messages in the zookeeper logs?
Created 06-28-2017 11:49 PM
It looks like you are seeing this issue: Background retry falls into infinite loop of reconnection after connection loss
Are all of the zookeeper instances running? Are you seeing any messages in the zookeeper logs?
Created 06-29-2017 02:47 PM
@Wynner yes, all of my zookeeper instances are running, we use an external zookeeper not the NiFi embedded zookeeper and all of the instances have been running fine. The day this issue started to happen apparently one of the instances was having issues but since yesterday all of the instances have been working fine and all the services seem to be running but still the node keeps having an issue connecting to zookeeper. However the other two nodes seem to be just fine connecting to zookeeper and joining the cluster.
Created 06-29-2017 04:37 PM
What do the zookeeper logs show for the node that is having issues? Does it show the node trying to connect to zookeeper?
Created 06-29-2017 07:11 PM
@Wynner, no logs show of the node. In the zookeeper logs I logs for the other two nodes in the cluster but not for the one that is having problems.
Created 06-30-2017 12:30 PM
Are you able to ping the zookeeper systems from the NiFi node that is having the issue?
I found an article where another user is seeing this issue. They said they cleared state in the state/zookeeper directory on all of the nodes, but don't remove the myid file, and restarted all of the nodes at the same time. I don't know if this is an option for you or not. Here is a link to the article I found Zookeeper error
Created 06-30-2017 06:04 PM
I just saw the same error in a cluster I have for testing. I was able to make the error occur and the only way it would clear is if I restarted all of the nodes in my cluster at the same time.
How you tried that with your cluster yet?
Created 06-30-2017 09:43 PM
@Wynner
I had to try a couple of times but after a couple of tries of restarting the nodes at the same time the node was able to join the cluster. Thanks for the help!
Created 11-04-2021 08:11 PM
Hi, may i know if you have managed to solve this problem? I was configuring the nifi cluster on the VM with the external zookeeper, and I faced this problem as well. I been struggling for this issues for weeks but still have no ideas to solve it