Created 08-07-2019 01:42 PM
Hello Everyone,
We have a Nifi production cluster made of 4 nodes from HDF 3.1.2.0, using our HDP Zookeeper. We had a major breakdown and had to restart all cluster and since the restart one of the node is unable to join the cluster again.
Here is the log I have on it :
2019-08-06 17:56:32,385 ERROR [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl Background retry gave up
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:838)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64)
at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2019-08-06 17:56:32,394 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: LOST
2019-08-06 17:56:32,394 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1ef0f788 Connection State changed to LOST
2019-08-06 17:56:32,394 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@10371a26 Connection State changed to LOST
2019-08-06 17:56:32,398 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
2019-08-06 17:56:32,398 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1ef0f788 Connection State changed to RECONNECTED
2019-08-06 17:56:32,398 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@10371a26 Connection State changed to RECONNECTED
When trying a manual connection with the zkCli I don't have any issue and can browse the znode. I try to restart all the cluster several time, and even reboot the server. But for now no success except for a couple of hours and then the node is disconnected again and giving the same log.
Any idea would be wonderful ?
Best regards
Created 09-12-2019 12:34 AM
We end this by completely re-installing the failing node.
Created 08-09-2019 12:54 PM
Ok so after increasing all the timeout parameter and thread for handling cluster protocol listed here :
nifi.zookeeper.connect.timeout nifi.zookeeper.session.timeout nifi.cluster.node.protocol.max.threads nifi.cluster.node.protocol.threads nifi.cluster.node.connection.timeout nifi.cluster.node.read.timeout nifi.cluster.protocol.heartbeat.interval
My node now start and can join the cluster but after a few minutes, I'm having the following bulletin
response time from was slow for each of the last 3 requests made
And the node start to behave badly and I need to stop it to stabilize the cluster.
Also it seems that this node have flowfile metadata which does not match the content repository and I have many log like the following
org.apache.nifi.processor.exception.FlowFileHandlingException: StandardFlowFileRecord[uuid=68f7abcf-fb70-4c8c-b0d1-4b6aaf64dc90,claim=,offset=0,name=91993057860031,size=0] is not known in this session (StandardProcessSession[id=336081 ])
Does someone know a solution to purge all the flowfile while the node is offline ?
Created 09-12-2019 12:34 AM
We end this by completely re-installing the failing node.