Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Zombie Nodes; Disconnected State but can't delete

avatar
Explorer

PRELUDE: the state of a zombie node is during testing where we are purposefully knocking over a node in bad ways to test our scalability and resiliency. Although these are test cases, it could be a real world situation that we want to understand and be able to accommodate. 

Running Nifi in a containerized cluster. If a node falls over gracefully, it will allow itself to be disconnected / deleted from the cluster. If a node does not fall over gracefully, and the node is terminated the cluster heartbreak detects the missing node and flags it as disconnected. 

 

In this case, neither a UI delete nor toolkit/API delete will delete the node, and returns an error 

 

 

java.net.SocketTimeoutException: timeout

 

 

 

Looking through the logs, when the delete call comes in, it almost seems like the coordinator is trying to announce to all the nodes, including the zombie node of the delete, and that is what is causing the timeout.

 

This could be due to the Cluster thinking this node has a running job data and is trying to do the right thing and protecting us from ourselves.. However two things: 

  1. We dont even have running flows in this scenario. We've stood up the cluster, made some users, hooked to registry, and added a empty, non-running processor group.
  2. Unfortunately, even if that node did have some lingering jobs/data on it those are lost. The node is gone, and there is no way to offload/other and really that node just needs to be deleted from the cluster.

Additional observation is that if the UI is left running during this state, the UI itself (when left idle) will often just raise the same timeout message. 

 

Occasionally, if we bounce the cluster coordinator, we are able to then disconnect the zombie node. But this isn't always the case, and more often gets into a state where the whole cluster needs to bounce in order to rectify.  We would like to try to avoid needing to bounce the whole cluster to get out of the state.

 

Best case solution, we are just doing something wonky, and someone can call us out on it and the correction and we can stop getting into the zombie/jammed state

 

Worse case solution: As stated in 2 above, the node is toast and gone. There would be no data recovery. If there was/is a way to force remove a node in a 'last hope scenario, this would be better than the state that is left in with the zombie node.

 

4 REPLIES 4

avatar
Explorer

Update for those playing along at home: What appears to be happening is when a zombie node is disconnected via a heartbeat fail, and it cant be deleted, is that the cluster coordinator for whatever reason is still trying to talk to it about the delete. I've tricked it into working by adding the hostname to the coordinators /etc/hosts and pointing to itself (the coordinator) ... its clear that this is not the desired/anticipated path, but its unclear why this is happening and if it is something in how we've configured this. Cluster joins are just done via Nifi internals (we dont do anything different) .. the only thing we are doing here is forcibly knocking over the node so it goes unhealthy in the cluster and is flagged as disconnected

avatar
New Contributor

Hi,

 

Im also trying same nifi cluster on ecs configuration and facing zombie node issue, node are disconnected  but not able to delete them do you guys get any solutions?

avatar
Super Mentor

@cgmckeever 

It would be helpful if you shared the specific on your NiFi version you are testing with and the environment on which you are testing it.

Thanks,

Matt

avatar
Explorer

Dockerized nifi:1.13.2, testing in AWS ECS