Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (1)
Master Guru

Note: This article was written as of HDF 2.1.2 release which is based off Apache NiFi 1.1.0.

NiFi has several processors that can be used to retrieve FlowFiles from a SFTP server. It is important to understand the different capabilities each provides so you know when you should be using one vs another.

Lets start with the oldest of the available processors: GetSFTP

The GetSFTP processor is the original processor introduced for retrieving files from a remote SFTP server. The processor has the following configurable properties:

14807-screen-shot-2017-04-21-at-115145-am.png

Things to know about this processor:

1. Disadvantage: This processor does not retain any state. This means it does keep track of which files it has previously retrieved. So if property "Delete Original" is set to "false", this processor will continue to retrieve the same file over and over again.

2. This processor is not cluster friendly, meaning in a NiFi cluster it should be set to run on "primary node only" so that every node in the cluster is not competing to pull the same data.

3. In NiFi cluster, the data retrieved by the GetSFTP processor should be redistributed across all nodes before further processing is done. This spreads out the work load to all nodes so the "primary node" is not doing all the work.

14808-screen-shot-2017-04-21-at-121652-pm.png

The above sample shows a "Remote Process Group" being used to redistribute the data from the GetSFTP processor to all nodes within the cluster.

*** As of Apache NiFi 1.8 release a new capability has been added to NiFi to facilitate easy redistribution of FlowFiles without needing to use a RPG. It can be done by simply configuring any connection to perform the redistribution. This blog explains that new capability very well: https://blogs.apache.org/nifi/entry/load-balancing-across-the-cluster ***


Disadvantage: All data content is pulled in to the primary node before being distributed across cluster.

Disadvantage: Since GetSFTP processor needs to delete source file in order to prevent continuos consumption, the data is unavailable to other users/servers.

Note: This processor has been deprecated in favor of the newer ListSFTP and FetchSFTP processors. It still exists to maintain backwards compatibility for NiFi users.

-----------------------------------------------------

Now let's talk about the ListSFTP and Fetch SFTP processors and what disadvantages above were solved by these processors.

The ListSFTP processor is designed to connect to a SFTP server just like GetSFTP did; however, it does not actual retrieve the data. Instead it creates a 0 byte FlowFile for every File it lists from the SFTP server.

The FetchSFTP processor takes these 0 byte FlowFiles as input and actually retrieves the associated data and inserts it in to the FlowFile content a that time.

I know it sounds like we just replicated what GetSFTP processor does but split it between two processors, but there are key advantages to doing it this way.

1. The ListSFTP processor does maintain state across a NiFi cluster. So if you leave do not delete the source data, this processor will not pickup the same data multiple times like the GetSFTP processor will.

2. While the ListSFTP processor is still not cluster friendly, meaning it should be run on Primary Node only, the FetchSFTP processor is cluster friendly. The ListSFTP processor should be used to create the 0 byte FlowFiles and then use a Remote Process Group to distribute these FlowFiles across your cluster. Then the Fetch SFTP processor is used on every node to retrieve the actual FlowFile content from the SFTP server.

14809-screen-shot-2017-04-21-at-123609-pm.png

*** As of Apache NiFi 1.8 release a new capability has been added to NiFi to facilitate easy redistribution of FlowFiles without needing to use a RPG. It can be done by simply configuring any connection to perform the redistribution. This blog explains that new capability very well: https://blogs.apache.org/nifi/entry/load-balancing-across-the-cluster ***


Advantage: The Primary node is no longer using excess resources writing all content to its content repository before redistributing the FlowFiles to all nodes in the cluster.

Advantage: Cluster wide state allows Primary node to switch within your NiFi cluster and the ListSFTP processor will still not list the same files twice.

Advantage: Being able to leave files on SFTP server, allows that data be consumed by other end users/systems.

Disadvantage: Using an RPG to redistribute the listSFTP generated FlowFiles can be annoying since the remote input port the RPG sends to must exist on root canvas level. So if flow is nested down in a sub-process group, you must build a flow that feeds the load-balanced FlowFiles back down in to that sub process group.

---------------------------------------

You will find within NiFi several other examples of where processors have been deprecated for newer list/fetch based processors.

Thank you,

Matt

12,847 Views
Comments

Hey @Matt Clarke, if there is a better way to do this w/o RPG as you suggested in your answer over in https://community.hortonworks.com/questions/245373/nifi-cluster-listensmtp.html, would you have time to update this article to account for that? I point folks to this link all the time. Thanks!

Master Guru

@Lester Martin

Thank you for bringing this to my attention. I have updated the above article to include a link to a great blog written by a friend of mine about the new load-balanced connection capability introduced in Apache NiFi 1.8.

thanks

Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
2 of 2
Last update:
‎08-17-2019 01:17 PM
Updated by:
 
Contributors
Top Kudoed Authors