Created 10-19-2021 07:56 AM
I am trying to use a ListSFTP processor to delete files from an SFTP site older than the most recent "History" file. In order to do this, I need to continually (process would run daily or weekly) query all the files in the SFTP server to see if a new "History" file was added, and if so to remove all files from the site before this newest "History" file. The problem is that ListSFTP only shows the newest files added so I am unable to do this. I can ClearState on the processor to make it work, but I would like to avoid this being a manual process. So how can I have ListSFTP show all the files every single time I run it?
Note: I've seen some posts about automatically clearing the state using a curl request, but this only works on a stopped processor which defeats the purpose of this being an automated process (https://community.cloudera.com/t5/Support-Questions/Nifi-clear-state/td-p/242347)
Created 10-19-2021 01:55 PM
It i snot clear to me how you have designed your dataflow to remove all files from source SFTP server except newest file? Assuming state was not an issue (since you said you flow works if you manually clear state), how do you have your flow built?
There exists a GetSFTP processor that does not maintain state. So you could have your flow that uses the listSFTP and FetchSFTP to always get the newest "history" file and record that that latest "history" files last modified timestamp in something like a distributedMapCache server. Then have your GetFile run once a day using the "Cron driven" scheduling strategy to get all files (Delete Original= false)in that directory (would get latest history file also) and then get the current stored last modified time from the map cache and then via a RouteOnAttribute send any to FlowFiles where last modified stored is newer then what is on files retrieved by GetFile and finally send to a processor to remove them from source SFTP processor.
While above would work in an "ideal" world. You would run in to issues when their was an interruption in the running dataflow causing multiple new files to get listed by the listSFTP processor because you would not know which one end up having its last modified timestamp stored in distributedMapCache. But in such a case the worst case if you have a couple files left lingering until the next run results in just one history file being listed and it goes back to expected.
Otherwise, there are script base processor you could use to build you own scripted handling here. To be honest it seems like wasted IO to have NiFi consume these files int NiFi just to auto-terminate them when you could use an ExecuteStreamCommand processor to invoke a script that connects to your SFTP server and simply removes what you do not want without needing to pull anything across the network or write file content to NiFi that you don't need
Hopefully this gives you some options to think about.
If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.
Thank you,
Matt
Created 10-19-2021 01:55 PM
It i snot clear to me how you have designed your dataflow to remove all files from source SFTP server except newest file? Assuming state was not an issue (since you said you flow works if you manually clear state), how do you have your flow built?
There exists a GetSFTP processor that does not maintain state. So you could have your flow that uses the listSFTP and FetchSFTP to always get the newest "history" file and record that that latest "history" files last modified timestamp in something like a distributedMapCache server. Then have your GetFile run once a day using the "Cron driven" scheduling strategy to get all files (Delete Original= false)in that directory (would get latest history file also) and then get the current stored last modified time from the map cache and then via a RouteOnAttribute send any to FlowFiles where last modified stored is newer then what is on files retrieved by GetFile and finally send to a processor to remove them from source SFTP processor.
While above would work in an "ideal" world. You would run in to issues when their was an interruption in the running dataflow causing multiple new files to get listed by the listSFTP processor because you would not know which one end up having its last modified timestamp stored in distributedMapCache. But in such a case the worst case if you have a couple files left lingering until the next run results in just one history file being listed and it goes back to expected.
Otherwise, there are script base processor you could use to build you own scripted handling here. To be honest it seems like wasted IO to have NiFi consume these files int NiFi just to auto-terminate them when you could use an ExecuteStreamCommand processor to invoke a script that connects to your SFTP server and simply removes what you do not want without needing to pull anything across the network or write file content to NiFi that you don't need
Hopefully this gives you some options to think about.
If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.
Thank you,
Matt