I am having an issue where occasionally a node will drop out of the cluster and will not rejoin after restarting.
I am running NiFi 1.1.0 on VMs. 3 node HDF cluster, with 2 of the nodes running NiFi and the third running ambari/ranger/LogSearch/.
On the running node it will display 1/2 nodes connected. On the failed node the UI stays at the "voting on a flow" screen.
The only way I can get the node to reconnect to the cluster is by removing the flowfile_repository directory.
Unfortunately I can't post logs without significant cleansing, but if there is something in particular I can post.
Can you please share the exception stack trace from failed node where it says failed to connect node to cluster ? It might have the reason explained in the log. You might have mismatched flows on both nodes and since you only have two nodes, NiFi is not able to come up with a majority of votes.
What reason if NiFi giving in the nifi-app.log for the node disconnections?
Rather then restarting the node that disconnects, did you try just clicking the reconnect icon in the cluster UI?
Verify that your nodes do not have trouble communicating with each other. Makes there are no firewalls between the nodes affecting communications to the HTTP/HTTPS ports:
nifi.web.http.host=nifi-ambari-08.openstacklocal nifi.web.http.port=8090 nifi.web.https.host=nifi-ambari-08.openstacklocal nifi.web.https.port=9091
or node communication port:
nifi.cluster.node.address=nifi-ambari-08.openstacklocal nifi.cluster.node.protocol.port=9088Make sure Both yo r nodes are properly configured to talk to ZK and neither has issues communicating with them:
nifi.zookeeper.connect.string=nifi-ambari-09.openstacklocal:2181,nifi-ambari-07.openstacklocal:2181,nifi-ambari-08.openstacklocal:2181 nifi.zookeeper.connect.timeout=3 secs nifi.zookeeper.root.node=/nifi nifi.zookeeper.session.timeout=3 sec
All of the above setting are in the nifi.properties file.