Support Questions

p_vigreux · ‎08-07-2019

Hello Everyone,

We have a Nifi production cluster made of 4 nodes from HDF 3.1.2.0, using our HDP Zookeeper. We had a major breakdown and had to restart all cluster and since the restart one of the node is unable to join the cluster again.

Here is the log I have on it :

2019-08-06 17:56:32,385 ERROR [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl Background retry gave up
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:838)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64)
at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2019-08-06 17:56:32,394 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: LOST
2019-08-06 17:56:32,394 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1ef0f788 Connection State changed to LOST
2019-08-06 17:56:32,394 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@10371a26 Connection State changed to LOST
2019-08-06 17:56:32,398 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
2019-08-06 17:56:32,398 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1ef0f788 Connection State changed to RECONNECTED
2019-08-06 17:56:32,398 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@10371a26 Connection State changed to RECONNECTED

When trying a manual connection with the zkCli I don't have any issue and can browse the znode. I try to restart all the cluster several time, and even reboot the server. But for now no success except for a couple of hours and then the node is disconnected again and giving the same log.

Any idea would be wonderful ?

Best regards

p_vigreux · ‎09-12-2019

We end this by completely re-installing the failing node.

View solution in original post

p_vigreux · ‎08-09-2019

Ok so after increasing all the timeout parameter and thread for handling cluster protocol listed here :

nifi.zookeeper.connect.timeout
nifi.zookeeper.session.timeout
nifi.cluster.node.protocol.max.threads
nifi.cluster.node.protocol.threads
nifi.cluster.node.connection.timeout
nifi.cluster.node.read.timeout
nifi.cluster.protocol.heartbeat.interval

My node now start and can join the cluster but after a few minutes, I'm having the following bulletin

response time from was slow for each of the last 3 requests made

And the node start to behave badly and I need to stop it to stabilize the cluster.

Also it seems that this node have flowfile metadata which does not match the content repository and I have many log like the following

org.apache.nifi.processor.exception.FlowFileHandlingException: StandardFlowFileRecord[uuid=68f7abcf-fb70-4c8c-b0d1-4b6aaf64dc90,claim=,offset=0,name=91993057860031,size=0] is not known in this session (StandardProcessSession[id=336081
])

Does someone know a solution to purge all the flowfile while the node is offline ?

p_vigreux · ‎09-12-2019

We end this by completely re-installing the failing node.

Cloudera Community

Support Questions

Nifi node doesn't join the cluster anymore

Offload NiFi Cluster Nodes using the NiFi Toolkit ...

Creating a 3 node NiFi cluster using Vagrant and V...

Error Securing NiFi Cluster with a Single Certific...

HDF 2.x - Adding a new NiFi Node to an existing se...

Restart Nifi nodes without data loss

one node is not starting in NIFI Cluster

Multi Node Hadoop Cluster setup with Hbase and Zoo...

3 Node Cluster, Nodes Disconnect

Nifi cluster changing primary node very often

NIFI: Node discovering for Cluster