Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Nifi ListHDFS missing 1 file per poll.

avatar

I am running a ListHDFS processor pointing to a directory on hdfs on a timer driven schedule set to execute once per hour. After making sure the state is clear on the processor, I run it and see that it creates a flow file for all but 1 file in the directory. There are 5 files in the directory, and only 4 flowfiles are created. If I add more files and clear the state and attempt to run again, the pattern repeats, always one less flowfile is create, so one file is missed. It is not the same file that is missed with each run.

Why is the processor missing 1 file each time? Is this by design?

This is in HDF 2.1.0.1 and Apache NiFi - Version 1.1.0.2.1.0.1-1

1 ACCEPTED SOLUTION

avatar
Super Mentor
@Max Evers

Does the 1 file that is being left behind have the most recent timestamp of all files consumed?

NiFi records state based on the timestamp of the most recent file listed. The problem that can occur is that if multiple files are being written in to the target location at the same time, they may not all make it into the listing being performed. So if NiFi recorded that timestamp in state, on next run those other files would not be listed and would never get fetched. So the idea is to list all files except those with the latest timestamp. In most cases, this is only 1 or 2 files not being listed. So what ends up being listed is all but any files with the same most current timestamp. This ensures that even when time differs between your NiFi server and target HDFS servers that all files get listed on next processor execution.

Please let us know if you are seeing different behavior.

Thanks,

Matt

View solution in original post

6 REPLIES 6

avatar
Master Guru

Is there any pattern about the file that is missed? Is it always the latest modification time of all the files in the directory?

You can turn on DEBUG logging for org.apache.nifi.processors.hadoop.ListHDFS by editing logback.xml and you should see some more information that might be helpful.

avatar

As mentioned above on Matt's comment, yes, the one left behind always has the latest timestamp.

avatar
Super Mentor
@Max Evers

Does the 1 file that is being left behind have the most recent timestamp of all files consumed?

NiFi records state based on the timestamp of the most recent file listed. The problem that can occur is that if multiple files are being written in to the target location at the same time, they may not all make it into the listing being performed. So if NiFi recorded that timestamp in state, on next run those other files would not be listed and would never get fetched. So the idea is to list all files except those with the latest timestamp. In most cases, this is only 1 or 2 files not being listed. So what ends up being listed is all but any files with the same most current timestamp. This ensures that even when time differs between your NiFi server and target HDFS servers that all files get listed on next processor execution.

Please let us know if you are seeing different behavior.

Thanks,

Matt

avatar

Yes, the one that is left behind is the latest generated file. The last file gets picked up on the second run. My use case was looking for a listing of all the files in an hdfs directory at a given moment. GetHDFS provides that functionality with the inefficient overhead of bringing the actual files into nifi. I was hoping to just get the list of files with listHDFS. I'm thinking I might look into ExecuteStreamCommand to generate the list with a hdfs dfs -ls and parse that list.

avatar
Super Mentor

@Max Evers

There was a recent change to ListFile to change this exact same behavior.

https://issues.apache.org/jira/browse/NIFI-3213

An apache Jira could be opened asking that the same change be adapted to listHDFS as well.

Thanks,

Matt

avatar
Master Guru