Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

ListSFTP and FetchFTP duplicate files generated

avatar

Hi,

We have a process that generates text files in a ftp server. The process generates the files by writing multiple passes and could take from say more than a minute to a few minutes to generate the files.

 

The nifi ListSFTP in this case generates multiple files as we are using tracking timestamps. The file's timestamp is getting updated continuously during the generation process.

 

We dont have control over the existing process that generates the files and has no notification mechanism in place that can tell us when the file generation is complete.

 

 

What would be the best practice to handle this scenario in a List+FetchSFTP flow to avoid duplicate files generated in the target system. 

 

Please advice.

 

Thank you

2 ACCEPTED SOLUTIONS

avatar
Super Mentor

@littlesea374 

 

The ListSFTP processor's only mechanism for ignoring files in the listing directory are those marked as hidden (start with . on linux based systems).   The dot rename of file transfer is pretty common with SFTP.

Now if the files are being streamed into the SFTP server by another processes that does not use some form of dot rename or filename change, you would need the new feature added to listSFTP as part of https://issues.apache.org/jira/browse/NIFI-5977

 

This new feature is part of Apache NiFi 1.10 which adds a couple new configuration properties to the listSFTP processor.  Minimum File Age is what you would need to use.  Only files where the last update time is older than this configured value would be listed.

Hope this helps,

Matt

View solution in original post

avatar
Super Mentor

@littlesea374 

 

The WAIT processor requires a release signal that is typically created using the NOTIFY processor.

So that really will not help here.

Perhaps you could try setting a penalty on each FlowFile.  Penalized FlowFiles are not processed by the follow on processor until the penalty duration has ended.  This can be done using the ExecuteScript processor after listSFTP:

Screen Shot 2019-11-14 at 4.45.27 PM.png

 

You then set length of "Penalty Duration" via the settings tab.  Set penalty high enough to ensure the file writes have completed.  Of course this does introduce some latency.

What this will not help with is listSFTP still listing the same files multiple times.  As data is written, the timestamp written on that source FlowFile updates, which means it will get listed again as if it is a new file.

but the delay here allows full data to be written and then perhaps you can use a detectDuplicate processor to remove duplicates based on filename before you actually fetch the content.

just some thoughts here, but that JIra is probably best path....

 

Matt

View solution in original post

4 REPLIES 4

avatar
Super Mentor

@littlesea374 

 

The ListSFTP processor's only mechanism for ignoring files in the listing directory are those marked as hidden (start with . on linux based systems).   The dot rename of file transfer is pretty common with SFTP.

Now if the files are being streamed into the SFTP server by another processes that does not use some form of dot rename or filename change, you would need the new feature added to listSFTP as part of https://issues.apache.org/jira/browse/NIFI-5977

 

This new feature is part of Apache NiFi 1.10 which adds a couple new configuration properties to the listSFTP processor.  Minimum File Age is what you would need to use.  Only files where the last update time is older than this configured value would be listed.

Hope this helps,

Matt

avatar

Thank you @MattWho , in case we are not able to upgrade to 1.10 or backport it to the current version. Would Wait processor be an acceptable direction to pursue?

avatar
Super Mentor

@littlesea374 

 

The WAIT processor requires a release signal that is typically created using the NOTIFY processor.

So that really will not help here.

Perhaps you could try setting a penalty on each FlowFile.  Penalized FlowFiles are not processed by the follow on processor until the penalty duration has ended.  This can be done using the ExecuteScript processor after listSFTP:

Screen Shot 2019-11-14 at 4.45.27 PM.png

 

You then set length of "Penalty Duration" via the settings tab.  Set penalty high enough to ensure the file writes have completed.  Of course this does introduce some latency.

What this will not help with is listSFTP still listing the same files multiple times.  As data is written, the timestamp written on that source FlowFile updates, which means it will get listed again as if it is a new file.

but the delay here allows full data to be written and then perhaps you can use a detectDuplicate processor to remove duplicates based on filename before you actually fetch the content.

just some thoughts here, but that JIra is probably best path....

 

Matt

avatar

Hi @MattWho 

We are pursuing cloudera to get a backport of this feature in the current release, as we are unable to upgrade hdf to a version having this feature.

 

Meanwhile I am trying to implement the same using the ExecuteScript(penalize flow file), and eliminate duplicates.

 

When trying to eliminate the duplicates, the detectduplicate does not work as expected when 2 different file names are introduced in the flowfile. Is this expected behaviour? How do we handle this scenario to have unique file list of multiple files over a period of time.

Capture.PNG

 

The first generateflowfile sets filename as file1.txt and second generateflowfile sets filename as file2.txt.

 

When i disable one of the generateflowfile, it is filtering the duplicates as expected. But when both the generateflowfile are running, we are getting duplicates.

 

Capture2.PNG

 

Thank you