question Zombie Nodes; Disconnected State but can't delete in Support Questions

question Zombie Nodes; Disconnected State but can't delete in Support Questions https://community.cloudera.com/t5/Support-Questions/Zombie-Nodes-Disconnected-State-but-can-t-delete/m-p/315146#M226362 PRELUDE: the state of a zombie node is during testing where we are purposefully knocking over a node in bad ways to test our scalability and resiliency. Although these are test cases, it could be a real world situation that we want to understand and be able to accommodate.  Running Nifi in a containerized cluster. If a node falls over gracefully, it will allow itself to be disconnected / deleted from the cluster. If a node does not fall over gracefully, and the node is terminated the cluster heartbreak detects the missing node and flags it as disconnected.  In this case, neither a UI delete nor toolkit/API delete will delete the node, and returns an error   <LI-CODE lang="markup">java.net.SocketTimeoutException: timeout</LI-CODE>   Looking through the logs, when the delete call comes in, it almost seems like the coordinator is trying to announce to all the nodes, including the zombie node of the delete, and that is what is causing the timeout. This could be due to the Cluster thinking this node has a running job data and is trying to do the right thing and protecting us from ourselves.. However two things: <OL><LI>We dont even have running flows in this scenario. We've stood up the cluster, made some users, hooked to registry, and added a empty, non-running processor group.</LI><LI>Unfortunately, even if that node did have some lingering jobs/data on it those are lost. The node is gone, and there is no way to offload/other and really that node just needs to be deleted from the cluster.</LI></OL>Additional observation is that if the UI is left running during this state, the UI itself (when left idle) will often just raise the same timeout message.  Occasionally, if we bounce the cluster coordinator, we are able to then disconnect the zombie node. But this isn't always the case, and more often gets into a state where the whole cluster needs to bounce in order to rectify.  We would like to try to avoid needing to bounce the whole cluster to get out of the state. Best case solution, we are just doing something wonky, and someone can call us out on it and the correction and we can stop getting into the zombie/jammed state Worse case solution: As stated in 2 above, the node is toast and gone. There would be no data recovery. If there was/is a way to force remove a node in a 'last hope scenario, this would be better than the state that is left in with the zombie node.  Fri, 23 Apr 2021 12:27:39 GMT cgmckeever 2021-04-23T12:27:39Z