Support Questions

Find answers, ask questions, and share your expertise

Nifi balancing cause loss of data

avatar
Explorer

Hi,

I was editing configuration for connection from single node to round robin balancing and then the two nodes goes down. 

Then i tried to offload nodes from nifi ui but it seemed which had no effect, and then I tried to decommission node but it gave me this problem after 30 iterations:

Iteration: 27
Successfully executed get-node command
Node 898f9 still not offloaded
Retry after 10 sec
Iteration: 28
Successfully executed get-node command
Node 898f9 still not offloaded
Retry after 10 sec
Iteration: 29
Successfully executed get-node command
Node 898f9 still not offloaded
Retry after 10 sec
ERROR: Nifi node offload with id 898f9, failed!

Does anybody knows how to Offload node in this case?

1 REPLY 1

avatar
Master Mentor

@pandav 
You can not offload a NiFi node that is down.  Can you clarify what you mean by "down"?  Was the NiFi service not running on the nodes you attempted to offload?

The offload option from the cluster UI sends a request to the disconnected (not down) node to offload its queued FlowFiles to nodes still connected to the cluster.

If your nodes are down, you'll need to start the service on those nodes again. On startup (assuming no issues), these nodes will rejoin your cluster.  If you plan to decomission a node later, you can use the NiFi cluster UI to manually disconnect a node and then offload that nodes FlowFiles.  Once the FlowFiles have been successfully offloaded, the node can be deleted from the cluster using the NiFi cluster UI. 

Note: restarting a node that has been dropped/deleted from the cluster will trigger that node to start heartbeating to the cluster and thus reconnect unless you edit the configuration of the node so it does not use the same zookeeper znode as the current cluster (nifi.zookeeper.root.node property in nifi.properties file). https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#basic-cluster-setup

 

As far as your nodes going down on a configuration change, you'll want to inspect the NiFi logs for an exceptions or timeouts that may have occurred.  Network issues, long Garbage Collection (GC) pauses, and resource congestion/exhaustion can lead to nodes not responding or receiving the replicated change request.  As a result a node can get disconnected.  In the scenarios like this if you are using the latest Apache NiFi release, those nodes should automatically reconnect.  Upon reconnect, if the nodes flow does not match the cluster flow, the node will automatically take the clusters flow and join.  In order release a flow mismatch would between connecting node and cluster flow, would require manual intervention (copying the flow.xml.gz from a node still in the cluster to the node not connecting).

 

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt