Created 10-24-2017 05:16 PM
We have been having a problem where a ListSFTP processor in NiFi isn't producing a flowfile for all new files as expected. The ListSFTP processor is configured to run every 10 minutes and is pointed to an external SFTP server where new files are dropped daily. Everything has been working as expected until recently when a few of the file types that we regularly download were not picked up by the ListSFTP. I have a couple questions to help me understand what may be going on here:
1. Does NiFi only look at the "Last Modified" timestamp on the remote file and compare it to the timestamp of the processor's view state to determine if a file is "new"? (In other words, it doesn't have anything to do with whether the filename has been seen before.)
2. Could this situation be caused by a difference between the "Last Modified" date on the new files and when they actually show up in the SFTP listing. I believe there are cases where the file doesn't show up until a few minutes after its Last Modified date. For example, the Last Modified date is 4:18 but the file doesn't show up in the listing until 4:20.
3. If this is actually what is happening, could it be fixed by changing settings on the SFTP processor?
Created 10-24-2017 05:57 PM
Yes ListSFTP processor only look for new files that got created after the state that processor stored.
State value is max time stamp of the file created in that directory.
Example:-
lets assume that listsftp processor has listed all the files in the directory until 4:10 then processor scheduled to run for every 10 minutes next run is at 4:20.
There are new files(test1.txt,test2.txt) got created at 4:11 then these new files(test.txt,test2.txt) will only be listed at 4:20 run(because processor runs for every 10 mins) and then processor updates the state with the 4:11 time stamp.(you can view by right clicking on the processor and click on view state).
Although flow files got created at 4:11 still they will be listed only at 4:20 run, because in this run processor checks for the new files that got created after state value.
If you configure this processor to less frequent i.e less than 10 minutes then processor will looks for new files that got created more often.
Created 10-24-2017 05:57 PM
Yes ListSFTP processor only look for new files that got created after the state that processor stored.
State value is max time stamp of the file created in that directory.
Example:-
lets assume that listsftp processor has listed all the files in the directory until 4:10 then processor scheduled to run for every 10 minutes next run is at 4:20.
There are new files(test1.txt,test2.txt) got created at 4:11 then these new files(test.txt,test2.txt) will only be listed at 4:20 run(because processor runs for every 10 mins) and then processor updates the state with the 4:11 time stamp.(you can view by right clicking on the processor and click on view state).
Although flow files got created at 4:11 still they will be listed only at 4:20 run, because in this run processor checks for the new files that got created after state value.
If you configure this processor to less frequent i.e less than 10 minutes then processor will looks for new files that got created more often.
Created 10-24-2017 08:06 PM
Thank you, can you say a little more about what "got created at 4:11" means for the ListSFTP processor? If someone put the files on the FTP server at 4:11, but the last modified date of the files is earlier than that (say 4:00), would ListSFTP never create flowfiles for them?
Created 10-24-2017 08:34 PM
@Karl Fredrickson, what i mean to say at 4:11 is file creation time stamp in the directory
For Example:-
bash# hdfs dfs -ls /user/yashu/test_fac/ Found 1 items -rwxr-xr-x 3 hdfs hdfs 8 2017-10-24 04:11 /user/yashu/test_fac/000000_0
in this example 000000_0 file got created at 2017-10-24 04:11(time stamp).
But the processor runs at 4:20 that means above 000000_0 file is going to listed in 4:20 run.
if the last modified date is earlier than 4:00 but someone put the files at 4:11?
then ListSFTP won't create flow files because it will only pulls new files that got created after the state value.
Created 10-24-2017 08:33 PM
The normal behavior of the ListSFTP processor is the first listing, meaning it has no state yet, will get a listing of all current files in the remote directory. The subsequent listings will get all new files, written since the last time stamp listed in the processor, except for the last one or two files. These one or two files will be listed in the next listing the processor creates and any additional new files based on the updated state for the processor, except for again the latest one or two files and so on as the processor runs.
The ListSFTP processor doesn't use the file name in anyway.
If you want files to be listed closer to the time they are being written to the directory, then set the processor to run more often than every 10 minutes.