- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
NiFi-1.1.0 "zookeeper.KeeperException$ConnectionLossException" error
- Labels:
-
Apache NiFi
Created ‎04-24-2017 02:48 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I run a NiFi cluster with 3 nodes using NiFi-1.1.0 version. The cluster has been running with no issues for the last couple of months, however I checked it today and one of the nodes had suddenly disconnected and it won't join back to the cluster. I checked the logs of the node and the following error keeps appearing non stop in the logs:
ERROR [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl Background operation retry gave up org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.6.jar:3.4.6-1569965] at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:728) [curator-framework-2.11.0.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:857) [curator-framework-2.11.0.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809) [curator-framework-2.11.0.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64) [curator-framework-2.11.0.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267) [curator-framework-2.11.0.jar:na] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_45] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_45] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_45] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_45] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_45] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45] 2017-04-24 10:31:34,694 ERROR [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl Background retry gave up org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:838) [curator-framework-2.11.0.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809) [curator-framework-2.11.0.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64) [curator-framework-2.11.0.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267) [curator-framework-2.11.0.jar:na] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_45] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_45] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_45] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_45] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_45] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
I had never encounter this issue before and wanted to know if someone could give me an idea of what could be causing it or how it could be fixed. I am a bit confused as no changes have been made to the configurations of the node or the cluster whatsoever and the other two nodes are working completely fine. Any insight on this issue would be greatly appreciated.
Created ‎04-24-2017 03:44 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When I have seen this error, it was caused by a corrupt flow file repository file, but that was caused by filling the partition to 100%. Or, the zookeeper servers running embedded in NiFi and the systems were unable to respond within the timeout period.
As a test/workaround, try dropping the node out of the cluster and see if it can run standalone. If that does not work, stop NiFi, move the current data in the flow file, content and provenance repositories and restart NiFi, it then should join back into the cluster.
Created ‎04-24-2017 02:53 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Check all of your NiFi repositories and make sure none of your disk partitions are at 100%.
Are you zookeepers running embedded?
Created ‎04-24-2017 02:58 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Wynner, for this cluster I do not use embedded zookeeper as we have our own zookeeper cluster and that one is the one that is used to manage the NiFi cluster. However I checked zookeeper and everything seems fine on it. I will check the NiFi repositories and the space on the disk partitions.
Created ‎04-24-2017 03:20 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Wynner, all of the disk partitions are in less than 10% use and I just tested disconnecting and connecting the other nodes which use the exact same zookeeper connection string as the node that is giving problems and they rejoin the cluster with no issues or errors. Could something have caused a file to get corrupted on this node and make it be causing the error?
Created ‎04-24-2017 03:44 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When I have seen this error, it was caused by a corrupt flow file repository file, but that was caused by filling the partition to 100%. Or, the zookeeper servers running embedded in NiFi and the systems were unable to respond within the timeout period.
As a test/workaround, try dropping the node out of the cluster and see if it can run standalone. If that does not work, stop NiFi, move the current data in the flow file, content and provenance repositories and restart NiFi, it then should join back into the cluster.
Created ‎04-24-2017 08:50 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Wynner, I was able to rejoin the cluster by moving the data in the repositories and restarting the node like you said. Thanks for the help!
Created ‎10-27-2017 05:12 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Have you solved this problem? I have met the same problem.
Created ‎10-27-2017 11:47 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Xu Zhe
As mentioned in the comments by Wynner, the solution I used was to clean up the NiFi repositories and restart the cluster nodes at the same time.
Created ‎10-30-2017 05:12 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks.Are your cluster stable?
Created ‎10-30-2017 06:36 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Xu Zhe, yes the cluster is fully stable now.
