Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Nifi node doesn't join the cluster anymore

avatar
Explorer

Hello Everyone,

We have a Nifi production cluster made of 4 nodes from HDF 3.1.2.0, using our HDP Zookeeper. We had a major breakdown and had to restart all cluster and since the restart one of the node is unable to join the cluster again.

Here is the log I have on it :

2019-08-06 17:56:32,385 ERROR [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl Background retry gave up
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:838)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64)
at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2019-08-06 17:56:32,394 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: LOST
2019-08-06 17:56:32,394 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1ef0f788 Connection State changed to LOST
2019-08-06 17:56:32,394 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@10371a26 Connection State changed to LOST
2019-08-06 17:56:32,398 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
2019-08-06 17:56:32,398 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1ef0f788 Connection State changed to RECONNECTED
2019-08-06 17:56:32,398 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@10371a26 Connection State changed to RECONNECTED


When trying a manual connection with the zkCli I don't have any issue and can browse the znode. I try to restart all the cluster several time, and even reboot the server. But for now no success except for a couple of hours and then the node is disconnected again and giving the same log.


Any idea would be wonderful ?


Best regards

1 ACCEPTED SOLUTION

avatar
Explorer

We end this by completely re-installing the failing node.

View solution in original post

2 REPLIES 2

avatar
Explorer

Ok so after increasing all the timeout parameter and thread for handling cluster protocol listed here :

nifi.zookeeper.connect.timeout
nifi.zookeeper.session.timeout
nifi.cluster.node.protocol.max.threads
nifi.cluster.node.protocol.threads
nifi.cluster.node.connection.timeout
nifi.cluster.node.read.timeout
nifi.cluster.protocol.heartbeat.interval

My node now start and can join the cluster but after a few minutes, I'm having the following bulletin

response time from was slow for each of the last 3 requests made

And the node start to behave badly and I need to stop it to stabilize the cluster.

Also it seems that this node have flowfile metadata which does not match the content repository and I have many log like the following

org.apache.nifi.processor.exception.FlowFileHandlingException: StandardFlowFileRecord[uuid=68f7abcf-fb70-4c8c-b0d1-4b6aaf64dc90,claim=,offset=0,name=91993057860031,size=0] is not known in this session (StandardProcessSession[id=336081
])

Does someone know a solution to purge all the flowfile while the node is offline ?

avatar
Explorer

We end this by completely re-installing the failing node.