Support Questions

p_vigreux · ‎08-07-2019

Hello Everyone,

We have a Nifi production cluster made of 4 nodes from HDF 3.1.2.0, using our HDP Zookeeper. We had a major breakdown and had to restart all cluster and since the restart one of the node is unable to join the cluster again.

Here is the log I have on it :

2019-08-06 17:56:32,385 ERROR [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl Background retry gave up
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:838)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64)
at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2019-08-06 17:56:32,394 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: LOST
2019-08-06 17:56:32,394 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1ef0f788 Connection State changed to LOST
2019-08-06 17:56:32,394 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@10371a26 Connection State changed to LOST
2019-08-06 17:56:32,398 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
2019-08-06 17:56:32,398 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1ef0f788 Connection State changed to RECONNECTED
2019-08-06 17:56:32,398 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@10371a26 Connection State changed to RECONNECTED

When trying a manual connection with the zkCli I don't have any issue and can browse the znode. I try to restart all the cluster several time, and even reboot the server. But for now no success except for a couple of hours and then the node is disconnected again and giving the same log.

Any idea would be wonderful ?

Best regards

p_vigreux · ‎09-12-2019

We end this by completely re-installing the failing node.

View solution in original post

p_vigreux · ‎08-09-2019

Ok so after increasing all the timeout parameter and thread for handling cluster protocol listed here :

nifi.zookeeper.connect.timeout
nifi.zookeeper.session.timeout
nifi.cluster.node.protocol.max.threads
nifi.cluster.node.protocol.threads
nifi.cluster.node.connection.timeout
nifi.cluster.node.read.timeout
nifi.cluster.protocol.heartbeat.interval

My node now start and can join the cluster but after a few minutes, I'm having the following bulletin

response time from was slow for each of the last 3 requests made

And the node start to behave badly and I need to stop it to stabilize the cluster.

Also it seems that this node have flowfile metadata which does not match the content repository and I have many log like the following

org.apache.nifi.processor.exception.FlowFileHandlingException: StandardFlowFileRecord[uuid=68f7abcf-fb70-4c8c-b0d1-4b6aaf64dc90,claim=,offset=0,name=91993057860031,size=0] is not known in this session (StandardProcessSession[id=336081
])

Does someone know a solution to purge all the flowfile while the node is offline ?

p_vigreux · ‎09-12-2019

We end this by completely re-installing the failing node.

Cloudera Community

Support Questions

Nifi node doesn't join the cluster anymore

Offload NiFi Cluster Nodes using the UI (NiFi 1.8....

Offload NiFi Cluster Nodes using the NiFi Toolkit ...

Load balancing in NiFi - Heterogenous Nodes in Clu...

Nifi nodes can't access Nifi Registry buckets

Creating a 3 node NiFi cluster using Vagrant and V...

HDF 2.x - Adding a new NiFi Node to an existing se...

error nifi connecting as cluster

one node is not starting in NIFI Cluster

Multi Node Hadoop Cluster setup with Hbase and Zoo...

3 Node Cluster, Nodes Disconnect