Created 05-09-2018 04:36 PM
I just noticed that ListSFTP can use a distributed cache controller. This is confusing to me because I thought we were supposed to only run ListSFTP on the primary node, and rebalance filenames via S2S RPG.
In addition to the distributed cache, it also seems to store state. This is confusing to me because if we use a distributed cache controller, why would the ListSFTP need to store state?
What is the current best practice for resiliant, parallelized sftp? If I use a distributed cache, does that mean I can just schedule my ListSFTP to run on all nodes? Can someone help me understand what is going on here? Thanks!
Created 05-09-2018 04:45 PM
The use of the distributedMapCache controller service by ListSFTP processor is used to store state information and has nothing to do with distribution of FlowFiles to all nodes in a NiFi cluster.
-
The ListSFTP processor should still only be scheduled to run on primary node only. Then fed to a RPG to redistribute the listed Flowfiles to all nodes before a FetchSFTP processor retrieves the content.
-
Zookeeper is used to elect a primary node in your NIFi cluster and from time to time a new primary node may be elected. The purpose of the distributeMapCache is so the last known state of the running listSFTP processor can be stored somewhere all nodes can access it. That way when a primary node change occurs, the ListSFTP processor that starts running on a different node does not list files the previous primary node ListSFTp processor already listed.
-
Thanks,
Matt
Created 05-09-2018 04:45 PM
The use of the distributedMapCache controller service by ListSFTP processor is used to store state information and has nothing to do with distribution of FlowFiles to all nodes in a NiFi cluster.
-
The ListSFTP processor should still only be scheduled to run on primary node only. Then fed to a RPG to redistribute the listed Flowfiles to all nodes before a FetchSFTP processor retrieves the content.
-
Zookeeper is used to elect a primary node in your NIFi cluster and from time to time a new primary node may be elected. The purpose of the distributeMapCache is so the last known state of the running listSFTP processor can be stored somewhere all nodes can access it. That way when a primary node change occurs, the ListSFTP processor that starts running on a different node does not list files the previous primary node ListSFTp processor already listed.
-
Thanks,
Matt
Created 05-09-2018 04:59 PM
Thanks @Matt Clarke, But if processor state alone cannot be used to handle primary node changes, how do processors like GenerateTableFetch work without a distributedMapCache service? Both ListSFTP and GenerateTableFetch mention in their docs that they store cluster-scoped state, but only ListSFTP can also make use of a cache service. What am I missing here?
Created 05-09-2018 05:26 PM
Sorry for leaving out some of the details.
The ListSFTP and FetchSFTP processors where originally developed and added to NiFi back in Apache NiFi 0.4.0. The 0.x versions of NiFi did not use zookeeper for cluster or state. ZK was introduced with the major redesign work that went into Apache NIFi 1.x.
-
To avoid issues for users upgrading from 0.x to 1.x+ versions of NiFi, the distributedMapCache properties remained in all the state keeping processors created prior to 1.x release. NiFi 1.x and newer version will automatically store state in ZK. Having the DistributeMapCache configured gives users who had state previously stored in a cache server to be read and NiFi would then write that state to ZK moving forward.
-
So in newer versions (NiFi 1.x +), there is no need to use the distributeMapCache property.
-
Thanks,
Matt
Created 05-09-2018 05:31 PM
thanks @Matt Clarke, what would we do without you!?
Created 05-29-2018 09:51 AM
Thanks @Matt Clarke. To load balance after output from ListSFTP (and similar single node processors), could you make a do-nothing RPG on root canvas (same cluster as caller) that simply return FlowFiles for continued load balanced processing from caller of RPG? That is, will a round-trip by a RPG make the (ListSFTP) FlowFiles load balanced after that? Could look like ListSFTP => round-trip over RPG load balancer (no-action) => continued balanced flow of FlowFiles from ListSFTP..
The reason for a generic (no-action) load balancer RPG is that I would like to include all flow logic with the hierarchical PG(s) where the project is setup, and not suddenly break-off to root-canvas. I have multiple different credentials and settings in use for FetchSFTP across projects, and lack a view on how to elegantly move all those out to root canvas for load balancing.
Could this be solved more natively / behind-the-scenes by NiFi in future upgrades, so user don't have to so explicitly take care of this load balancing after a single-node processor?
/Henrik
Created 05-29-2018 01:22 PM
We completely understand how load-balanced redistribution of FlowFiles via a RPG is not the most elegant solution. There was consideration of separating input/output ports in to two different components (local and remote). In addition to the complexity with this, we also have to consider the backward compatibility. What impact with this have on NiFi users upgrading with flows developed with previous versions of NiFi.
-
Another option being looked at is adding the ability to enable load-balancing within the cluster directly on any connection. This would just be a new configuration option on existing connections with the default just behaving as connections do now. By enabling the load-balancing option, FlowFiles would be load-balanced across all connected nodes behind the scenes automatically. There are still technical hurdle here and no time table for this effort as of now.
-
Thank you,
Matt