Created 05-25-2017 02:26 PM
I am running a ListHDFS processor pointing to a directory on hdfs on a timer driven schedule set to execute once per hour. After making sure the state is clear on the processor, I run it and see that it creates a flow file for all but 1 file in the directory. There are 5 files in the directory, and only 4 flowfiles are created. If I add more files and clear the state and attempt to run again, the pattern repeats, always one less flowfile is create, so one file is missed. It is not the same file that is missed with each run.
Why is the processor missing 1 file each time? Is this by design?
This is in HDF 2.1.0.1 and Apache NiFi - Version 1.1.0.2.1.0.1-1
Created 05-25-2017 02:55 PM
Does the 1 file that is being left behind have the most recent timestamp of all files consumed?
NiFi records state based on the timestamp of the most recent file listed. The problem that can occur is that if multiple files are being written in to the target location at the same time, they may not all make it into the listing being performed. So if NiFi recorded that timestamp in state, on next run those other files would not be listed and would never get fetched. So the idea is to list all files except those with the latest timestamp. In most cases, this is only 1 or 2 files not being listed. So what ends up being listed is all but any files with the same most current timestamp. This ensures that even when time differs between your NiFi server and target HDFS servers that all files get listed on next processor execution.
Please let us know if you are seeing different behavior.
Thanks,
Matt
Created 05-25-2017 02:50 PM
Is there any pattern about the file that is missed? Is it always the latest modification time of all the files in the directory?
You can turn on DEBUG logging for org.apache.nifi.processors.hadoop.ListHDFS by editing logback.xml and you should see some more information that might be helpful.
Created 05-25-2017 03:24 PM
As mentioned above on Matt's comment, yes, the one left behind always has the latest timestamp.
Created 05-25-2017 02:55 PM
Does the 1 file that is being left behind have the most recent timestamp of all files consumed?
NiFi records state based on the timestamp of the most recent file listed. The problem that can occur is that if multiple files are being written in to the target location at the same time, they may not all make it into the listing being performed. So if NiFi recorded that timestamp in state, on next run those other files would not be listed and would never get fetched. So the idea is to list all files except those with the latest timestamp. In most cases, this is only 1 or 2 files not being listed. So what ends up being listed is all but any files with the same most current timestamp. This ensures that even when time differs between your NiFi server and target HDFS servers that all files get listed on next processor execution.
Please let us know if you are seeing different behavior.
Thanks,
Matt
Created 05-25-2017 03:23 PM
Yes, the one that is left behind is the latest generated file. The last file gets picked up on the second run. My use case was looking for a listing of all the files in an hdfs directory at a given moment. GetHDFS provides that functionality with the inefficient overhead of bringing the actual files into nifi. I was hoping to just get the list of files with listHDFS. I'm thinking I might look into ExecuteStreamCommand to generate the list with a hdfs dfs -ls and parse that list.
Created 05-25-2017 04:10 PM
There was a recent change to ListFile to change this exact same behavior.
https://issues.apache.org/jira/browse/NIFI-3213
An apache Jira could be opened asking that the same change be adapted to listHDFS as well.
Thanks,
Matt
Created 05-25-2017 05:35 PM
Created this: