Support Questions

Find answers, ask questions, and share your expertise

Using NiFi for transfers from multiple SFTP servers

avatar
Explorer

Hi guys,

I am currently evaluating if NiFi can replace one or two of our tools (Talend OS/VisualCron).
Therefor I need to find out how NiFi can be used to transfer (mainly "move") files from one sftp source to another. We have at the moment around 500 jobs in Visual Cron doing this. There are 40-50 servers in use and the jobs differ by file pattern and path.

The closest solution I have found so far would be a custom processor mentioned in this post. But it seems outdated and need to get some fixes before it can be successfully build again.
Either I could store all server and job configurations in an external file (yaml/xml) to create flowfiles or as preferred to store this data directly in NiFi - but not sure how.

The benefit I see against VisualCron to let NiFi do the job is his queue system. This will make our transfers a lot more stable which is not at the moment.

Do you think NiFi would be the right choice for this and can help me to find a suitable solution without creating 500 flows in one canvas? I want to have only one since all are almost the same beside some parametrization.

8 REPLIES 8

avatar
Community Manager

@mirkom Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our NiFi experts @MattWho @joseomjr  who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.


Regards,

Diana Torres,
Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Super Collaborator

If I get time, I might see how the current SFTP processor works. From the sounds of it you'd want a processor that can do several file patterns/paths per SFTP host to avoid having hundreds of flows.

avatar
Super Collaborator

Looks like custom might not needed since file and path allows for RegEx which means you could define multiple:

joseomjr_0-1721686487337.png

 

avatar
Explorer

It is not only different paths. We also have multiple hosts/credentials. Each customer has usually different environments on separate hosts where we connect via SFTP to transfer some files (csv/Excel). These files come from multiples source SFTP accounts (same server, but differents creds per customer) and also sometimes from third parties (other servers).

So what we need is also to set the Host name and credentials dynamically, based on a specific job configuration like:

SourceHost: SourceHost_A
SourceCreds: SourceCreds_A
SourcePath: SourcePath_A
SourceFilePattern: SourceFilePattern_A
DestHost: DestHost_A
DestCreds: DestCreds_A
DestPath: DestPath_A

These are the minimum configuration parameters which must be set dynamically. This describes a file transfer from A to B. We have hundreds of such transfers running involving ~300 different sftp accounts.

Most of them are basic without any need for changes to the data. Sometimes changing file names and of course more complex transformation is needed. But in the majority of cases it is just basic moving the files from A to B.

avatar
Master Mentor

@mirkom 

NiFi is a flow based programming application.  Processors configuration properties can get there values from parameter contexts which might be useful for you here. Some processors can get values by using NiFi Expression Language. 

NiFi is designed as as an "always on" with its dataflows using available scheduling strategies offered.  Source processors (those with no inbound connections) need to have a valid configuration in order to start.  Meaning the properties need at a minimum to execute must be available to the processor.  So as a source processor, the only way to have those values is if they are set directly on the processor or pulled from a parameter context.  

In NiFi you can create a Process Group (PG) and then build a reusable dataflow within it (From you description it sounds like you have only a few different needed flow designs to meet your use cases). 

For you reusable dataflow, you should use the ListSFTP connected to a FetchSFTP to ingest data.  ON a process group you can configure/define a "parameter context".  A parameter context holds the unique configuration values for each of your source and dest host info.   So you would have 500 different parameter contexts.  So you can copy your PG many times and simply assign a different parameter context to each one making dataflow development a bit easier.

So building out in this way makes expansion easier, but still requires some work.  

Also keep in mind that you have many use case where you are simply moving content from SFTP server A to SFTP server B.  When you utilize NiFi for this use case, you are ingesting all that content into your NiFi cluster and they writing it back out another SFTP server.  This adds some overhead in read and write operations versus a direct transfer between A and B without the local write that NiFi would do.  

What NiFi lest you do is manage all these dataflows through the NiFi UI.  NIFi allows you to scale out by adding more nodes to the cluster as workload and volume increases easily without needing to modify your dataflows (assuming they are built well with data distribution implemented in the designs). 

Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

avatar
Explorer

Copying a PG 500 times to just apply different parameters contexts, sounds not like an ideal solution. Also having all those workflow-identical PGs in a single canvas would make it hard to stay on top on things.
But as far as I understand that is the only solution in NiFi I guess.

avatar
Master Mentor

@mirkom 

As far as the post you referred to in your original question.  It is not accurate. The GetSFTP does NOT accept an inbound connection.  The only SFTP ingest processor that accepts and inbound connection is the FetchSFTP processor (which is the processor that other query was actually referring to).  I also can't speak to the customized version of the listSFTP processor built in that other thread.  

Thanks,
Matt

avatar
Explorer

I referred to the other article because of the custom listSFTP processor. This would make it possible to use only one PG with the workflow "move file from A to B", read the job configuration by any other processor, put the config values on a flowfile and pass it to this listSFTP processor.