PRELUDE: the state of a zombie node is during testing where we are purposefully knocking over a node in bad ways to test our scalability and resiliency. Although these are test cases, it could be a real world situation that we want to understand and be able to accommodate.
Running Nifi in a containerized cluster. If a node falls over gracefully, it will allow itself to be disconnected / deleted from the cluster. If a node does not fall over gracefully, and the node is terminated the cluster heartbreak detects the missing node and flags it as disconnected.
In this case, neither a UI delete nor toolkit/API delete will delete the node, and returns an error
Looking through the logs, when the delete call comes in, it almost seems like the coordinator is trying to announce to all the nodes, including the zombie node of the delete, and that is what is causing the timeout.
This could be due to the Cluster thinking this node has a running job data and is trying to do the right thing and protecting us from ourselves.. However two things:
Additional observation is that if the UI is left running during this state, the UI itself (when left idle) will often just raise the same timeout message.
Occasionally, if we bounce the cluster coordinator, we are able to then disconnect the zombie node. But this isn't always the case, and more often gets into a state where the whole cluster needs to bounce in order to rectify. We would like to try to avoid needing to bounce the whole cluster to get out of the state.
Best case solution, we are just doing something wonky, and someone can call us out on it and the correction and we can stop getting into the zombie/jammed state
Worse case solution: As stated in 2 above, the node is toast and gone. There would be no data recovery. If there was/is a way to force remove a node in a 'last hope scenario, this would be better than the state that is left in with the zombie node.
Update for those playing along at home: What appears to be happening is when a zombie node is disconnected via a heartbeat fail, and it cant be deleted, is that the cluster coordinator for whatever reason is still trying to talk to it about the delete. I've tricked it into working by adding the hostname to the coordinators /etc/hosts and pointing to itself (the coordinator) ... its clear that this is not the desired/anticipated path, but its unclear why this is happening and if it is something in how we've configured this. Cluster joins are just done via Nifi internals (we dont do anything different) .. the only thing we are doing here is forcibly knocking over the node so it goes unhealthy in the cluster and is flagged as disconnected
Im also trying same nifi cluster on ecs configuration and facing zombie node issue, node are disconnected but not able to delete them do you guys get any solutions?