I have a 3 node NiFi cluster in the production environment. it has been observed that all the nodes are disconnected from nifi-cluster and all nodes are in running state.
Below are logged.
018-07-05 00:30:19,650 ERROR [Site-to-Site Worker Thread-90476] o.a.nifi.remote.SocketRemoteSiteListener Unable to communicate with remote instance Peer[url=<>] (SocketFlowFileServerProtocol[CommsID=9f090b3e-f5cd-4dae-b893-ef3ba983afef]) due to org.apache.nifi.cluster.exception.NoClusterCoordinatorException: No node has yet been elected Cluster Coordinator. Cannot establish connection to cluster yet.; closing connection
Nifi cluster coordinator relays on the zookeeper for the election. NiFi employs a Zero-Master Clustering paradigm. Each node in the cluster performs the same tasks on the data, but each operates on a different set of data. One of the nodes is automatically elected (via Apache ZooKeeper) as the Cluster Coordinator.
All nodes in the cluster will then send heartbeat/status information to this node, and this node is responsible for disconnecting nodes that do not report any heartbeat status for some amount of time. Additionally, when a new node elects to join the cluster, the new node must first connect to the currently-elected Cluster Coordinator in order to obtain the most up-to-date flow. If the Cluster Coordinator determines that the node is allowed to join (based on its configured Firewall file), the current flow is provided to that node, and that node is able to join the cluster, assuming that the node’s copy of the flow matches the copy provided by the Cluster Coordinator. If the node’s version of the flow configuration differs from that of the Cluster Coordinator’s, the node will not join the cluster.
Ensure you have 3 zookeepers (ensemble) to manage your Nifi cluster, and all should be up and running.
I should have said that this was using the embedded option for the coordinator. I managed to find the solution by way of an old mailinglist thread (from 2016)  that suggested starting the nodes at the same time, rather than one after the other. The reason is that if you start up one node, it will go into a connection frenzy failing to contact the other nodes, quickly hitting the connection limit (and leaving lots of dangling connections in "close wait" state).