Created 04-24-2017 02:48 PM
Hello,
I run a NiFi cluster with 3 nodes using NiFi-1.1.0 version. The cluster has been running with no issues for the last couple of months, however I checked it today and one of the nodes had suddenly disconnected and it won't join back to the cluster. I checked the logs of the node and the following error keeps appearing non stop in the logs:
ERROR [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl Background operation retry gave up org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.6.jar:3.4.6-1569965] at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:728) [curator-framework-2.11.0.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:857) [curator-framework-2.11.0.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809) [curator-framework-2.11.0.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64) [curator-framework-2.11.0.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267) [curator-framework-2.11.0.jar:na] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_45] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_45] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_45] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_45] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_45] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45] 2017-04-24 10:31:34,694 ERROR [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl Background retry gave up org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:838) [curator-framework-2.11.0.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809) [curator-framework-2.11.0.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64) [curator-framework-2.11.0.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267) [curator-framework-2.11.0.jar:na] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_45] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_45] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_45] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_45] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_45] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
I had never encounter this issue before and wanted to know if someone could give me an idea of what could be causing it or how it could be fixed. I am a bit confused as no changes have been made to the configurations of the node or the cluster whatsoever and the other two nodes are working completely fine. Any insight on this issue would be greatly appreciated.
Created 04-24-2017 03:44 PM
When I have seen this error, it was caused by a corrupt flow file repository file, but that was caused by filling the partition to 100%. Or, the zookeeper servers running embedded in NiFi and the systems were unable to respond within the timeout period.
As a test/workaround, try dropping the node out of the cluster and see if it can run standalone. If that does not work, stop NiFi, move the current data in the flow file, content and provenance repositories and restart NiFi, it then should join back into the cluster.
Created 04-24-2017 02:53 PM
Check all of your NiFi repositories and make sure none of your disk partitions are at 100%.
Are you zookeepers running embedded?
Created 04-24-2017 02:58 PM
@Wynner, for this cluster I do not use embedded zookeeper as we have our own zookeeper cluster and that one is the one that is used to manage the NiFi cluster. However I checked zookeeper and everything seems fine on it. I will check the NiFi repositories and the space on the disk partitions.
Created 04-24-2017 03:20 PM
@Wynner, all of the disk partitions are in less than 10% use and I just tested disconnecting and connecting the other nodes which use the exact same zookeeper connection string as the node that is giving problems and they rejoin the cluster with no issues or errors. Could something have caused a file to get corrupted on this node and make it be causing the error?
Created 04-24-2017 03:44 PM
When I have seen this error, it was caused by a corrupt flow file repository file, but that was caused by filling the partition to 100%. Or, the zookeeper servers running embedded in NiFi and the systems were unable to respond within the timeout period.
As a test/workaround, try dropping the node out of the cluster and see if it can run standalone. If that does not work, stop NiFi, move the current data in the flow file, content and provenance repositories and restart NiFi, it then should join back into the cluster.
Created 04-24-2017 08:50 PM
@Wynner, I was able to rejoin the cluster by moving the data in the repositories and restarting the node like you said. Thanks for the help!
Created 10-27-2017 05:12 AM
Hi,
Have you solved this problem? I have met the same problem.
Created 10-27-2017 11:47 AM
Hi @Xu Zhe
As mentioned in the comments by Wynner, the solution I used was to clean up the NiFi repositories and restart the cluster nodes at the same time.
Created 10-30-2017 05:12 AM
Thanks.Are your cluster stable?
Created 10-30-2017 06:36 PM
@Xu Zhe, yes the cluster is fully stable now.