Support Questions

Adda_Fuentes2 · ‎04-24-2017

Hello,

I run a NiFi cluster with 3 nodes using NiFi-1.1.0 version. The cluster has been running with no issues for the last couple of months, however I checked it today and one of the nodes had suddenly disconnected and it won't join back to the cluster. I checked the logs of the node and the following error keeps appearing non stop in the logs:

 ERROR [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl Background operation retry gave up
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:728) [curator-framework-2.11.0.jar:na]
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:857) [curator-framework-2.11.0.jar:na]
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809) [curator-framework-2.11.0.jar:na]
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64) [curator-framework-2.11.0.jar:na]
	at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267) [curator-framework-2.11.0.jar:na]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_45]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_45]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_45]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_45]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_45]
	at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
2017-04-24 10:31:34,694 ERROR [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl Background retry gave up
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:838) [curator-framework-2.11.0.jar:na]
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809) [curator-framework-2.11.0.jar:na]
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64) [curator-framework-2.11.0.jar:na]
	at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267) [curator-framework-2.11.0.jar:na]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_45]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_45]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_45]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_45]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_45]
	at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]

I had never encounter this issue before and wanted to know if someone could give me an idea of what could be causing it or how it could be fixed. I am a bit confused as no changes have been made to the configurations of the node or the cluster whatsoever and the other two nodes are working completely fine. Any insight on this issue would be greatly appreciated.

Wynner · ‎04-24-2017

@Adda Fuentes

When I have seen this error, it was caused by a corrupt flow file repository file, but that was caused by filling the partition to 100%. Or, the zookeeper servers running embedded in NiFi and the systems were unable to respond within the timeout period.

As a test/workaround, try dropping the node out of the cluster and see if it can run standalone. If that does not work, stop NiFi, move the current data in the flow file, content and provenance repositories and restart NiFi, it then should join back into the cluster.

View solution in original post

Wynner · ‎04-24-2017

@Adda Fuentes

Check all of your NiFi repositories and make sure none of your disk partitions are at 100%.

Are you zookeepers running embedded?

Adda_Fuentes2 · ‎04-24-2017

@Wynner, for this cluster I do not use embedded zookeeper as we have our own zookeeper cluster and that one is the one that is used to manage the NiFi cluster. However I checked zookeeper and everything seems fine on it. I will check the NiFi repositories and the space on the disk partitions.

Adda_Fuentes2 · ‎04-24-2017

@Wynner, all of the disk partitions are in less than 10% use and I just tested disconnecting and connecting the other nodes which use the exact same zookeeper connection string as the node that is giving problems and they rejoin the cluster with no issues or errors. Could something have caused a file to get corrupted on this node and make it be causing the error?

Wynner · ‎04-24-2017

@Adda Fuentes

When I have seen this error, it was caused by a corrupt flow file repository file, but that was caused by filling the partition to 100%. Or, the zookeeper servers running embedded in NiFi and the systems were unable to respond within the timeout period.

As a test/workaround, try dropping the node out of the cluster and see if it can run standalone. If that does not work, stop NiFi, move the current data in the flow file, content and provenance repositories and restart NiFi, it then should join back into the cluster.

Adda_Fuentes2 · ‎04-24-2017

@Wynner, I was able to rejoin the cluster by moving the data in the repositories and restarting the node like you said. Thanks for the help!

geniuszhe · ‎10-27-2017

Hi,

Have you solved this problem? I have met the same problem.

Adda_Fuentes2 · ‎10-27-2017

Hi @Xu Zhe

As mentioned in the comments by Wynner, the solution I used was to clean up the NiFi repositories and restart the cluster nodes at the same time.

geniuszhe · ‎10-30-2017

Thanks.Are your cluster stable?

Adda_Fuentes2 · ‎10-30-2017

@Xu Zhe, yes the cluster is fully stable now.

Cloudera Community

Support Questions

NiFi-1.1.0 "zookeeper.KeeperException$ConnectionLossException" error