Created 01-23-2017 01:13 AM
Hello,
We are about to move to a NiFi cluster environment from a Standalone NiFi instance; we did our dataflow development on the Standalone instance; looking at the NiFi Admin documentation, it appears that some of the processors we have in our dataflow need to be configured to work in a cluster environment;
From the admin guide...
"...In a NiFi cluster, the same dataflow runs on all the nodes. As a result, every component in the flow runs on every node. However, there may be cases when the DFM would not want every processor to run on every node.
...the GetSFTP processor pulls from a remote directory, and if the GetSFTP on every node in the cluster tries simultaneously to pull from the same remote directory, there could be race conditions. Therefore, the DFM could configure the GetSFTP on the Primary Node to run in isolation, meaning that it only runs on that node. It could pull in data and -with the proper dataflow configuration- load-balance it across the rest of the nodes in the cluster...."
We have GetSFTP and ListenTCP that fall into this case mentioned above; the verbiage above, from Admin guide, in bold, alludes to "with the proper dataflow configuration", the data from these processors can load balance across the nodes, but it didn't say what configuration is needed to be done; any suggestions or guidance is greatly appreciated.
I'm thinking that PutHDFS processor does not need any special configuration for a Cluster environment, since on each node, it can write to HDFS whatever data is sent to it (from the dataflow on that node) and no coordination is needed between the PutHDFS processors running on the various nodes. Is my assumption correct ?
I did find some guidance on how to configure PutEmail in a cluster environment.
Thanks in advance.
Created 01-23-2017 09:49 AM
Hi @Raj B,
Regarding the processors you mention:
- GetSFTP: if you want to load balance the access to the SFTP server, then I'd recommend the use of ListSFTP running on the primary node and the use of FetchSFTP on all the nodes. This way the actual download of files will be load balanced on your cluster without concurrent access to the files.
- Regarding ListenTCP, it depends if you want to have all your nodes of your cluster listening on a given port for TCP connection (with a load balancer in front of NiFi for example) or if you want only one node to listen on that port. However, keep in mind, there is no way to ensure that a given node will be the primary node, and going that way (having a client connecting to a specific node) is not recommended on a HA point of view.
- PutHDFS is fine as long as two nodes are not writing to the same path in HDFS.
Finally, there is one important point to remember to load balance data after a processor configured running on the primary node. If you do nothing special, then all the flow files will remain on the primary node from the beginning to the end. If you want to load balance the data, you need to connect your input processors to a Remote Process Group that is pointing to itself (to the cluster). This way the data will be actually load balanced on the cluster. You will find an example at the end of this post: https://pierrevillard.com/2016/08/13/apache-nifi-1-0-0-cluster-setup
Hope this helps.
Created 01-23-2017 09:49 AM
Hi @Raj B,
Regarding the processors you mention:
- GetSFTP: if you want to load balance the access to the SFTP server, then I'd recommend the use of ListSFTP running on the primary node and the use of FetchSFTP on all the nodes. This way the actual download of files will be load balanced on your cluster without concurrent access to the files.
- Regarding ListenTCP, it depends if you want to have all your nodes of your cluster listening on a given port for TCP connection (with a load balancer in front of NiFi for example) or if you want only one node to listen on that port. However, keep in mind, there is no way to ensure that a given node will be the primary node, and going that way (having a client connecting to a specific node) is not recommended on a HA point of view.
- PutHDFS is fine as long as two nodes are not writing to the same path in HDFS.
Finally, there is one important point to remember to load balance data after a processor configured running on the primary node. If you do nothing special, then all the flow files will remain on the primary node from the beginning to the end. If you want to load balance the data, you need to connect your input processors to a Remote Process Group that is pointing to itself (to the cluster). This way the data will be actually load balanced on the cluster. You will find an example at the end of this post: https://pierrevillard.com/2016/08/13/apache-nifi-1-0-0-cluster-setup
Hope this helps.
Created 01-23-2017 06:23 PM
Thanks @Pierre Villard, I'll look at the example you provided.
Just want to make sure I understood you correctly regarding ListenTCP; you're saying it's better for each node to listen to a port for incoming data in a HA, so if the sending system is sending data to one IP address, then we need a load balancer system/server that ingests that data and distributes it (in round robin or other fashion) to the various NiFi nodes; is that right ?
Created 01-23-2017 08:31 PM
Let's say that if you don't configure your ListenTCP to run on your primary node only, then it will run on every node. You can of course give one of the IP address of your cluster to the client opening TCP connections but the load is not balanced and if this node dies, the client is not able to open connections anymore. However, if you setup a load balancer in front of your cluster, you will have a VIP (virtual IP) that you will give to your client and connections will be opened in a round-robin manner to every nodes of the cluster (then you have HA and load balancing). It is generally up to the load balancer to ensure that every node of the cluster is alive with something like a heartbeat.
As a general comment, each time you need a ListenX processor, you will certainly be in a situation where you need a load balancer and a virtual IP.
Let me know if you have other questions.
Created 02-01-2017 03:00 PM
Thanks a lot @Pierre Villard
Created 02-07-2017 07:56 PM
As you were saying about needing a load balancer anytime using a ListenX processor and giving a Virtual IP to your sending system, is Kafka, which is now part of HDF, able to act as a load balancer ? or it has to be something external, like HAProxy?
Thanks.
Created 02-07-2017 08:02 PM
You can ask your sending system to send data to Kafka and then have NiFi to pull data from Kafka but the problem will be probably the same between your sender and Kafka (you will probably need to give the list of your kafka brokers to the sender and this is not ideal when you scale up/down). Regarding NiFi and Kafka, I recommend the following article:
http://bryanbende.com/development/2016/09/15/apache-nifi-and-apache-kafka
Created 02-07-2017 08:25 PM
Thanks @Pierre Villard, I read the article, it is talking about NiFi Kafka integration (within HDF cluster), which is quite helpful. But my interest is still regarding the interface between the HDF cluster and the sending system.
Is it possible to setup a virtual IP on the HDF Cluster, so that the sending system can send messages to that IP? if it is possible to do that, would that be done by a system level application (at the OS level) or HDF can do that ? sorry for my ignorance, but I'm not very knowledgeable about networking and IPs, etc.
Created 02-07-2017 08:38 PM
AFAIK HDF does not provide a solution for load balancing (unless you implement something yourself through ZooKeeper). You need to have specific hardware equipment or use software solutions like HA proxy (in a docker container for example).
Created 02-07-2017 09:37 PM
Thank you @Pierre Villard