Support Questions
Find answers, ask questions, and share your expertise

zookeeper connection error in NiFi version nifi-1.2.0.3.0.0.0-453

Rising Star

Hello,

I have a 3 node cluster all using the NiFi version nifi-1.2.0.3.0.0.0-453. The cluster has been working fine for the last couple of weeks, however today all of a sudden one of the nodes disconnected from the cluster and won't join the cluster back. I checked the logs and the error I see is the following:

ERROR [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl Background retry gave up
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:838)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
2017-06-27 17:54:40,179 ERROR [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl Background operation retry gave up
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:728)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:857)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

The node itself appears to still be running but is disconnected from the cluster. I tried restarting it but the same error keeps appearing over and over again. The other two nodes in the cluster are working fine. Does anyone have any idea of what could be causing this sudden issue? Any insights would be greatly appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions

@Adda Fuentes

It looks like you are seeing this issue: Background retry falls into infinite loop of reconnection after connection loss

Are all of the zookeeper instances running? Are you seeing any messages in the zookeeper logs?

View solution in original post

7 REPLIES 7

@Adda Fuentes

It looks like you are seeing this issue: Background retry falls into infinite loop of reconnection after connection loss

Are all of the zookeeper instances running? Are you seeing any messages in the zookeeper logs?

View solution in original post

Rising Star

@Wynner yes, all of my zookeeper instances are running, we use an external zookeeper not the NiFi embedded zookeeper and all of the instances have been running fine. The day this issue started to happen apparently one of the instances was having issues but since yesterday all of the instances have been working fine and all the services seem to be running but still the node keeps having an issue connecting to zookeeper. However the other two nodes seem to be just fine connecting to zookeeper and joining the cluster.

What do the zookeeper logs show for the node that is having issues? Does it show the node trying to connect to zookeeper?

Rising Star

@Wynner, no logs show of the node. In the zookeeper logs I logs for the other two nodes in the cluster but not for the one that is having problems.

@Adda Fuentes

Are you able to ping the zookeeper systems from the NiFi node that is having the issue?

I found an article where another user is seeing this issue. They said they cleared state in the state/zookeeper directory on all of the nodes, but don't remove the myid file, and restarted all of the nodes at the same time. I don't know if this is an option for you or not. Here is a link to the article I found Zookeeper error

@Adda Fuentes

I just saw the same error in a cluster I have for testing. I was able to make the error occur and the only way it would clear is if I restarted all of the nodes in my cluster at the same time.

How you tried that with your cluster yet?

Rising Star

@Wynner

I had to try a couple of times but after a couple of tries of restarting the nodes at the same time the node was able to join the cluster. Thanks for the help!