Created on 11-01-2017 09:44 AM - edited 08-18-2019 01:06 AM
Trying to get files from an FTP server with ListFTP/FetchFTP, but these speicifc processors are so confusing to me.
I have scheduled ListFTP to fire also with Cron and time-driven every 10 secs or so, but even though it shows a task is executed in the ListFTP processor, no flowfiles come out of it! GetFTP works fine, but i wanna implement it with List/Fetch to get only the new files in the dir.
Initially i scheduled the flow with cron at 03:00 last night when i knew a new file would become available in the FTP server around 1. However this morning i saw that nothing was transfered to HDFS. So i changed scheduling to every 60 secs to test it right away and guess what, i got the new file! So i thought Cron is the problem, deleted the test-file from HDFS and scheduled the flow to run at 10:00 this morning, as expected a task was created but no flowfiles were passed to FetchFTP.
Switching back to timer-driven scheduling to get the file i deleted back to HDFS from the FTP server, but this time there are no flowfiles created even for this scheduling option. What is going on here guys?
Created 11-01-2017 10:18 AM
Hi @balalaika
What's your NiFi version? this is a known issue resolved in NiFi 1.2 https://issues.apache.org/jira/browse/NIFI-3213
Even after this, there are several corner situations that the initial design of List* processor try to avoid. For instance, a file being written when List is fired should not be listed. Also, files created just after microsecond after List can be missed if the source system support timestamp with seconds granularity. So it's a tradeoff between missing some files, or delaying ingestion. List processor keeps only the timestamp of the last file ingested and not the list of files (for scalability reasons).
To deal with these situations, the last few files are not listed and kept for the next time. The design has been improved in NiFi 1.4, check it out : https://issues.apache.org/jira/browse/NIFI-4069 and https://issues.apache.org/jira/browse/NIFI-3332
Try to investigate on this information and see from where your problems come from. What you can do also is use a cron to run at 3:00 and 3:05 to get files missed the first time (assuming your data comes every 24hours). If you have seconds precisions in your listing then try the new processors in NiFi 1.4 which add a property "Target System Timestamp Precision".
Keep in mind that another corner case is not dealt with today which is writing a file at Time T, whith timestamp T2 knowing that T2 < T https://issues.apache.org/jira/browse/NIFI-2383
Created 11-01-2017 10:18 AM
Hi @balalaika
What's your NiFi version? this is a known issue resolved in NiFi 1.2 https://issues.apache.org/jira/browse/NIFI-3213
Even after this, there are several corner situations that the initial design of List* processor try to avoid. For instance, a file being written when List is fired should not be listed. Also, files created just after microsecond after List can be missed if the source system support timestamp with seconds granularity. So it's a tradeoff between missing some files, or delaying ingestion. List processor keeps only the timestamp of the last file ingested and not the list of files (for scalability reasons).
To deal with these situations, the last few files are not listed and kept for the next time. The design has been improved in NiFi 1.4, check it out : https://issues.apache.org/jira/browse/NIFI-4069 and https://issues.apache.org/jira/browse/NIFI-3332
Try to investigate on this information and see from where your problems come from. What you can do also is use a cron to run at 3:00 and 3:05 to get files missed the first time (assuming your data comes every 24hours). If you have seconds precisions in your listing then try the new processors in NiFi 1.4 which add a property "Target System Timestamp Precision".
Keep in mind that another corner case is not dealt with today which is writing a file at Time T, whith timestamp T2 knowing that T2 < T https://issues.apache.org/jira/browse/NIFI-2383
Created on 11-01-2017 01:48 PM - edited 08-18-2019 01:06 AM
Hi @Abdelkrim Hadjidj thanks for the reply. My Nifi version is 1.1.0 so what you say makes sense.
My flows are not so time-sensitive, meaning i can delay the ingestion for a couple of hours, but i want to understand a bit better the operations:
This is the timestamp in the FTP server of the last file transfered by this Nifi flow (via Data Provenance)
Now, if i schedule the LisFTP processor to fire e.g. today at 15:00 i would expect that the file would be parsed with no problem. This bug means that the file would be never parsed as long as it is the last modified file in this location? So in other words, ListFTP/HDFS/whatever performs a listing only if it sees that there are files with most recent timestamp than the last transfered in the directory? Also you mention to scheule cron for 3 & a bit later, is there an option to have 2 scheduling plans for one processor?
As far as i know, with cron you can only say something like: run this every 5 mins of that hour or so. Thanks in advance!
Created 11-01-2017 02:07 PM
Hi @balalaika
Yes you can use a Cron to do it twice per day. The following example will run at 3:00am and 3:05am each day.
0 5,0 3 * * ?
Regarding the logic of the processor in previous versions, I understand that the processor will get data if re-fired. It keeps the last files that are recent. However as you are in an old version I am not sure what was the design at that version.
Please try the cron expression and see if it helps. If you can upgrade at least for NiFi 1.2. Several bugs have been corrected.
If you found that this answer addressed your question, please take a moment to click "Accept" below.
Created 11-01-2017 03:14 PM
Hi @Abdelkrim Hadjidj for now i will implement it with GetFTP, Nifi is in service provider network and i cannot upgrade at will 😞
Do you maybe know a way to tell GetFTP not to download files that have already been downloaded in the past to avoid unneccesary buffers?