Created 05-07-2018 03:41 PM
What would be the recommendation for a 40gbps NiFi cluster?
And
How would load balancing be handled with clustering? Would 1 IP be given for the cluster, lets say 1 HandleHTTP processor. and then the Network load will be distributed to the other nodes?
Thank you!
Created 05-07-2018 04:35 PM
NiFi is a very difficult things to make a one size fits all sizing recommendation for. NiFi does not typically scale linearly. This is why you see the hardware specs exponentially increase as throughput increases. This is based on the fact that typical NiFi workflows all grow exponentially in size and complexity as the volume of throughput increases in most cases. More and more workflows are added.
-
Different NiFi processors in different workflows contribute to different server resource usage. That resource usage varies based processor configuration and FlowFile volume. So even two workflows using same processors may have different sizing needs.
-
How well a NiFi is going to perform has a lot to do with the workflow the user has build. After all it is this user designed work flow that is going to be using the majority of the resources on each node.
-
Best answer to be honest is to built your workflows and stress test them. This kind of a modeling and simulation setup. Learn the boundaries your workflows put on your hardware. At what data volume point does CPU utilization, network bandwidth, memory load, disk IO become my bottleneck for my specific workflow(s). Tweaking your workflows and component configurations. Then scale out by adding more nodes allowing some headroom considering it is very unlikely ever node will be processing the exact same number of NiFi FlowFiles all the time.
-
There are numerous way to handle load-balancing. It really depends on your dataflow design choices on how you intend to get data in to your NiFi. Keep in mind that each Nifi nodes in a cluster runs its own copy of the dataflows you build, has their own set of repositories, and thus works on their owns sets of FlowFiles.
-
While using NiFi's listener type processors would benefit from an external load-balancer to direct that incoming data across all nodes, processors like ConsumeKafka can run on all nodes consuming from same topic (assuming balanced number of Kafka partitions)
-
Other protocols like SFTP are not cluster friendly. So in dataflows like that you can only have something like ListenSFTP processor running on only one node at any given time. To achieve load-balancing there, a flow typically looks like:
ListenSFTP (configured to run primary node only) ---> Remote Process Group (used to re-distribute/load-balance 0 byte FlowFiles to rest of nodes) --> input port --> FetchSFTP (Pulls content for each FlowFile).
-
One thing you do not want to do in most cases is load-balance the NiFi UI. You can do this but need to make sure you use sticky sessions in your load-balancer here. The tokens issued for user authentication (ldap or kerberos) are only good for node that issued them to user so subsequent requests must go to same node.
-
Hope this gives you some direction.
-
Thanks,
Matt
Created 05-07-2018 03:57 PM
How many incoming connections are you expecting at any given time?
Created 05-07-2018 04:07 PM
I would say at least 1000 Incoming connection at any given time.
Created 05-07-2018 04:26 PM
Ok, what version of NiFi are you running?
Created 05-07-2018 04:35 PM
NiFi is a very difficult things to make a one size fits all sizing recommendation for. NiFi does not typically scale linearly. This is why you see the hardware specs exponentially increase as throughput increases. This is based on the fact that typical NiFi workflows all grow exponentially in size and complexity as the volume of throughput increases in most cases. More and more workflows are added.
-
Different NiFi processors in different workflows contribute to different server resource usage. That resource usage varies based processor configuration and FlowFile volume. So even two workflows using same processors may have different sizing needs.
-
How well a NiFi is going to perform has a lot to do with the workflow the user has build. After all it is this user designed work flow that is going to be using the majority of the resources on each node.
-
Best answer to be honest is to built your workflows and stress test them. This kind of a modeling and simulation setup. Learn the boundaries your workflows put on your hardware. At what data volume point does CPU utilization, network bandwidth, memory load, disk IO become my bottleneck for my specific workflow(s). Tweaking your workflows and component configurations. Then scale out by adding more nodes allowing some headroom considering it is very unlikely ever node will be processing the exact same number of NiFi FlowFiles all the time.
-
There are numerous way to handle load-balancing. It really depends on your dataflow design choices on how you intend to get data in to your NiFi. Keep in mind that each Nifi nodes in a cluster runs its own copy of the dataflows you build, has their own set of repositories, and thus works on their owns sets of FlowFiles.
-
While using NiFi's listener type processors would benefit from an external load-balancer to direct that incoming data across all nodes, processors like ConsumeKafka can run on all nodes consuming from same topic (assuming balanced number of Kafka partitions)
-
Other protocols like SFTP are not cluster friendly. So in dataflows like that you can only have something like ListenSFTP processor running on only one node at any given time. To achieve load-balancing there, a flow typically looks like:
ListenSFTP (configured to run primary node only) ---> Remote Process Group (used to re-distribute/load-balance 0 byte FlowFiles to rest of nodes) --> input port --> FetchSFTP (Pulls content for each FlowFile).
-
One thing you do not want to do in most cases is load-balance the NiFi UI. You can do this but need to make sure you use sticky sessions in your load-balancer here. The tokens issued for user authentication (ldap or kerberos) are only good for node that issued them to user so subsequent requests must go to same node.
-
Hope this gives you some direction.
-
Thanks,
Matt
Created 05-07-2018 04:53 PM
Matt,
Thank you for your thorough feedback!
How does nifi handle load balancing on its own if there was 1 listenHTTP processor on a cluster?
Would we give the endpoint communicating customer 1 IP to reach us? And then nifi would broker that connection to one of the nodes in the cluster? Or would the data come through 1 node and then distribute to the other nodes?
Basically I'm trying to use nifi clustering as the load balancer 🙂 Any suggestions? If that doesen't work I was going to try out HAproxy in front of the cluster IP.
John
Created 05-07-2018 05:02 PM
The ListenHTTP processor works just like any one of our other Listen based processors. This processor should be configured to run on every node. That way every node can receive data. The listen based processors are configured to Listen on a specific port. So the endpoint for a listenHTTP would be something like:
-
http(s)://<hostname>:<listenerport>/<base path>
-
You could have an external load-balancer that is configured to receive all your inbound traffic and load-balance it all the node sin the NiFi cluster.
-
You could also install NiFi or MiNiFi at each of your data sources and use NiFi's Site-To-Site (S2S) protocol to load-balance the delivery of FlowFiles to this target cluster.
-
Listen based processors are not ideal for the Listen (primary node) --> RPG (S2S) --> input port (all nodes) --> rest of dataflow model. Tat is because the Listen based processor receive the entire payload. This means your primary node has to handle a lot of writes to content repo (all data) before then sending that data across the network to other nodes (redistribution). can be an expensive waste of resources. That is why load-balancing with this type of processor is better done out front of NiFi.
-
Thanks,
Matt
Created 05-07-2018 06:14 PM
I have recently built out an HDF environment for a Fortune 1 retail company to handle 1-2k connections per node and move an average of 1-1.5TB a day. We utilized the HandleHTTP processors as MiNiFi was not an option at project conception. If you are using the HandleHTTPRequest/Response processors, note that there is a bug which causes objects to not be released correctly causing heap utilization to climb in a linear fashion. Our workaround was to utilize the API to stop/start the HandleHTTPRequest processor when the heap reached 70%.
This bug was corrected in the 1.6 release of NiFi but has not been rolled up into an HDF release since I last checked. So, handling that kind of volume will cause the same scenario in your situation. If you can use ListenHTTP (or MiNiFi as Matt suggested), you should be fine.
We were utilizing external load balancers as we were running three clusters in separate data centers. The plan in the next phase is to start utilizing MiNiFi in the edge environments and point the different systems feeding data into HDF at those MiNiFi HTTP listeners. If you are running a single cluster, as Matt mentioned, that would load balance for you.