Created 10-06-2021 12:50 PM
Hi,
I am looking to fetch only new files added in the directory exactly one time and once file is picked it should not be picked again in apache nifi. I want to schedule this process to to every 3 hours. Please provide solution with screenshot the properties you used to do this process or which processors you are using. I am bit confused between listfile getfile and fetchfile and which properties to used.
Any help in this issue will be greatly appreciated.
Thank You!
Created 10-07-2021 10:34 AM
Once it brings it it wont bring again because it will save its timestamp and then use that to get newer files added and so on.
Created 10-07-2021 02:40 PM
@CodeLa @SAMSAL
I want to point out that tracking timestamps will not always guarantee NiFi will consume all files from the input file directory depending on how they are being placed in that directory.
The ListFile processor looks at the last modified timestamp on the file. It then lists all files since the last recorded timestamp stored in NiFi state manager from the previous processor execution. On first run their will be no state and this everything currently is listed.
Now consider the scenarios below which can affect above from listing all files:
If you are in such a scenario, the ListFile offers a different "Listing Strategy" called "Tracking Entities" which tracks filenames as well in a cache service which allows it to still list files that may have an older timestamp.
Another thing to consider is listFile may list the same file more than once. Consider this scenario:
If you are in such a scenario, you would want to make use of the "Minimum File Age" property. This property tells the listFile to ignore any files were the last modified time stamp when compared to current time is not at least that configured amount of time old (that means last modified timestamp has not changed for configured amount of time). That configured time is arbitrary and what ever length is needed for you to be confident file write was complete.
Something else you need to consider depends on if both the following are true:
1. You are using a multi node NiFi cluster
2. The configured directory you are listing from is mounted to every node.
Since every node in a NiFi cluster is executing the same dataflow, you want to avoid every node from listing the same files. IN this scenario you would change the "Execution" configuration from "All nodes" to "Primary" on the ListFile and change "input Directory location" from "local" to "remote". Then you will want to set "load balance Strategy" to "Round Robin" on the connection between ListFile and FetchFile.
NOTE: Never set the Execution on any processor that has an inbound connection to "Primary node". ONLY processor with not inbound connection should be considered for this execution configuration.
I know this is a lot to digest, but very important to be aware of to ensure success.
If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.
Thank you,
Matt
Created 07-29-2024 09:13 AM
@varungupta
This is a ~3 year old post with an already accepted answer. You are likely to get more responsive answers if you were to start a new thread. NiFi would have also evolved considerable over the past 3 years.
Yes, tracking entities does not rely on timestamps to ensure listing of new FlowFiles and will help you here. NiFi grabbing 1 -2 of 20 is more then just timestamps, I suspect that how the files are being moved into the consumption directory is also impacting you.
Tracking Timestamps is easiest and least resource consumption default setup, but does not work for all use cases.
Timestamp is based on the last modified timestamp. When listing is performed it lists all Files with last processor state stored timestamp up to most recent file's last modified timestamp. Problem can happen if last modified timestamp is not updated.
For example some system writes to directory A on your local machine and after write completes, it moves file to Directory B. With that atomic move the file timestamp is not updated. If the move does not happen fast enough it may get missed in the current listing. it is also possible that a moved file has an older last modified timestamp that another smeller files moved quicker to dir B. Thus resulting a timestamp stored in state that would be newer and thus resulting in that other file being ignored.
Tracking entities was added to solution to these types of problems.
Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt
Created 07-31-2024 07:41 AM
Thanks a lot Matt for the answer.