Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Nifi node doesn't join the cluster anymore

Solved Go to solution

Nifi node doesn't join the cluster anymore

New Contributor

Hello Everyone,

We have a Nifi production cluster made of 4 nodes from HDF 3.1.2.0, using our HDP Zookeeper. We had a major breakdown and had to restart all cluster and since the restart one of the node is unable to join the cluster again.

Here is the log I have on it :

2019-08-06 17:56:32,385 ERROR [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl Background retry gave up
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:838)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64)
at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2019-08-06 17:56:32,394 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: LOST
2019-08-06 17:56:32,394 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1ef0f788 Connection State changed to LOST
2019-08-06 17:56:32,394 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@10371a26 Connection State changed to LOST
2019-08-06 17:56:32,398 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
2019-08-06 17:56:32,398 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1ef0f788 Connection State changed to RECONNECTED
2019-08-06 17:56:32,398 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@10371a26 Connection State changed to RECONNECTED


When trying a manual connection with the zkCli I don't have any issue and can browse the znode. I try to restart all the cluster several time, and even reboot the server. But for now no success except for a couple of hours and then the node is disconnected again and giving the same log.


Any idea would be wonderful ?


Best regards

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Nifi node doesn't join the cluster anymore

New Contributor

We end this by completely re-installing the failing node.

2 REPLIES 2

Re: Nifi node doesn't join the cluster anymore

New Contributor

Ok so after increasing all the timeout parameter and thread for handling cluster protocol listed here :

nifi.zookeeper.connect.timeout
nifi.zookeeper.session.timeout
nifi.cluster.node.protocol.max.threads
nifi.cluster.node.protocol.threads
nifi.cluster.node.connection.timeout
nifi.cluster.node.read.timeout
nifi.cluster.protocol.heartbeat.interval

My node now start and can join the cluster but after a few minutes, I'm having the following bulletin

response time from was slow for each of the last 3 requests made

And the node start to behave badly and I need to stop it to stabilize the cluster.

Also it seems that this node have flowfile metadata which does not match the content repository and I have many log like the following

org.apache.nifi.processor.exception.FlowFileHandlingException: StandardFlowFileRecord[uuid=68f7abcf-fb70-4c8c-b0d1-4b6aaf64dc90,claim=,offset=0,name=91993057860031,size=0] is not known in this session (StandardProcessSession[id=336081
])

Does someone know a solution to purge all the flowfile while the node is offline ?

Highlighted

Re: Nifi node doesn't join the cluster anymore

New Contributor

We end this by completely re-installing the failing node.