Support Questions

Find answers, ask questions, and share your expertise

Site-to-site protocol load balancing

avatar

Hi,

I'd like to understand how the receiving end of the site-to-site protocol works. The sending side drops a remote process group on the canvas and is mostly done. The receiving side - simple in case of a single NiFi node. In a cluster, though, we still need to specify a FQDN to connect to. What is the best practice there? If we put a load balancer in front, would it break batch affinity (when site2site batches sends for you)?

1 ACCEPTED SOLUTION

avatar
Expert Contributor

That's a great question! If you are connecting to a cluster, you would simply use the URL of the Cluster Manager. I.e., the URL that you would use in your browser to go to that cluster. In this case, NiFi will reach out and recognize that it is indeed a cluster and will then be given a list of nodes in the cluster, as well as their relative busy-ness (based currently on how many FlowFiles each currently has queued up in its flow). The sending side will then use this information to load balance across the entire cluster. So, for example, if we are sending to a cluster with two nodes, Node A and Node B, and Node A has 10 FlowFiles queued while Node B has 500, the sending side will more data to Node A than to Node B. For example, it may send 5 FlowFiles to Node A, then 1 to Node B, then 5 to Node A, and so on. After 60 seconds, it will poll the Cluster Manager again to determine how busy each of those nodes is again and automatically adjust how much data it is pushing to each of those nodes.

If the Cluster Manager is not available, the sending side will continue to push data to all of the nodes that it knows about, but it will do so assuming that all nodes should receive equal waiting. I.e., it will start to send 10 FlowFiles to Node A, then 10 to Node B, then 10 to Node A, and so on.

View solution in original post

3 REPLIES 3

avatar
Expert Contributor

That's a great question! If you are connecting to a cluster, you would simply use the URL of the Cluster Manager. I.e., the URL that you would use in your browser to go to that cluster. In this case, NiFi will reach out and recognize that it is indeed a cluster and will then be given a list of nodes in the cluster, as well as their relative busy-ness (based currently on how many FlowFiles each currently has queued up in its flow). The sending side will then use this information to load balance across the entire cluster. So, for example, if we are sending to a cluster with two nodes, Node A and Node B, and Node A has 10 FlowFiles queued while Node B has 500, the sending side will more data to Node A than to Node B. For example, it may send 5 FlowFiles to Node A, then 1 to Node B, then 5 to Node A, and so on. After 60 seconds, it will poll the Cluster Manager again to determine how busy each of those nodes is again and automatically adjust how much data it is pushing to each of those nodes.

If the Cluster Manager is not available, the sending side will continue to push data to all of the nodes that it knows about, but it will do so assuming that all nodes should receive equal waiting. I.e., it will start to send 10 FlowFiles to Node A, then 10 to Node B, then 10 to Node A, and so on.

avatar

Thanks Mark, that was a great and exhaustive answer (I'm thinking of how to express it in a slide for advanced level deck).

I guess the control plane HA for NCM itself is the next bastion, will probably require some changes on the client side (e.g. specify a list of failover NCM nodes to cycle through) as well as some UI updates to support it. Is my understanding correct?

avatar
Expert Contributor

Andrew, to put this in a few bullets for a slide I'd go with

  • Automatically Load Balanced
  • Fault Tolerant
  • Supports Back Pressure

When we setup the HA NCM, I think there will be no need to change anything on client side necessarily. Because NiFi provides the interactive command-and-control / immediate feedback, we can supply just a single URL. If NiFi cannot contact that URL it could ask for another. However, if it is able to contact the URL (the vast majority of the time it should be able to), it can auto-discover other nodes from that one.