Created 10-06-2021 12:50 PM
Hi,
I am looking to fetch only new files added in the directory exactly one time and once file is picked it should not be picked again in apache nifi. I want to schedule this process to to every 3 hours. Please provide solution with screenshot the properties you used to do this process or which processors you are using. I am bit confused between listfile getfile and fetchfile and which properties to used.
Any help in this issue will be greatly appreciated.
Thank You!
Created 10-07-2021 10:34 AM
Once it brings it it wont bring again because it will save its timestamp and then use that to get newer files added and so on.
Created 10-07-2021 02:40 PM
@CodeLa @SAMSAL
I want to point out that tracking timestamps will not always guarantee NiFi will consume all files from the input file directory depending on how they are being placed in that directory.
The ListFile processor looks at the last modified timestamp on the file. It then lists all files since the last recorded timestamp stored in NiFi state manager from the previous processor execution. On first run their will be no state and this everything currently is listed.
Now consider the scenarios below which can affect above from listing all files:
If you are in such a scenario, the ListFile offers a different "Listing Strategy" called "Tracking Entities" which tracks filenames as well in a cache service which allows it to still list files that may have an older timestamp.
Another thing to consider is listFile may list the same file more than once. Consider this scenario:
If you are in such a scenario, you would want to make use of the "Minimum File Age" property. This property tells the listFile to ignore any files were the last modified time stamp when compared to current time is not at least that configured amount of time old (that means last modified timestamp has not changed for configured amount of time). That configured time is arbitrary and what ever length is needed for you to be confident file write was complete.
Something else you need to consider depends on if both the following are true:
1. You are using a multi node NiFi cluster
2. The configured directory you are listing from is mounted to every node.
Since every node in a NiFi cluster is executing the same dataflow, you want to avoid every node from listing the same files. IN this scenario you would change the "Execution" configuration from "All nodes" to "Primary" on the ListFile and change "input Directory location" from "local" to "remote". Then you will want to set "load balance Strategy" to "Round Robin" on the connection between ListFile and FetchFile.
NOTE: Never set the Execution on any processor that has an inbound connection to "Primary node". ONLY processor with not inbound connection should be considered for this execution configuration.
I know this is a lot to digest, but very important to be aware of to ensure success.
If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.
Thank you,
Matt
Created on 10-06-2021 01:27 PM - edited 10-06-2021 01:28 PM
Take a look at the Nifi ListFile & Fetch File processors. They both work together. The ListFile will read files metadata based on the last read file modified date and will keep state of that so that only newly added files will be read. The fetch file will take the filename parameter from the ListFile processor and fetch the contents.
Hope that helps
Created 10-06-2021 09:39 PM
Hi samsal,
Thanks for the reply can you please share the screen shots i'm bit confused related to which properties to use in Listfile and fetchfile.
Created on 10-07-2021 09:46 AM - edited 10-07-2021 09:47 AM
You really dont need a screenshot because you are not changing much properties:
1- Create ListFile Processor & set the "Input Directory" to whatever directory you want to track.
2- Create a FetchFile Processor and connect the ListFile to it via the "success" relationship. under the processor properties keep the "File to Fetch" property set to "${absolute.path}/${filename}" since the path and the file name will be set in those attributes using the ListFile and that is it.
After that the content of the file will be passed via the success relation and you can do whatever you want with it just as if you are using GetFile except the ListFile will keep state of the latest file timestamp it grabbed and basically use that to grab any new files added to the folder and update the state to new timestamp and so.
Created 10-07-2021 10:30 AM
Hi samsal,
Thanks for your help. I have used list file and then fetch file and their is one only file in my directory and I've set Listing strategy in listfile to 'Tracking Timestamps' and when I executed the job it brings the file once only. I am confused will it bring same file only once or whenever I execute the job?
Created 10-07-2021 10:34 AM
Once it brings it it wont bring again because it will save its timestamp and then use that to get newer files added and so on.
Created 10-07-2021 11:06 AM
Got it. Thank you
Created 10-07-2021 02:40 PM
@CodeLa @SAMSAL
I want to point out that tracking timestamps will not always guarantee NiFi will consume all files from the input file directory depending on how they are being placed in that directory.
The ListFile processor looks at the last modified timestamp on the file. It then lists all files since the last recorded timestamp stored in NiFi state manager from the previous processor execution. On first run their will be no state and this everything currently is listed.
Now consider the scenarios below which can affect above from listing all files:
If you are in such a scenario, the ListFile offers a different "Listing Strategy" called "Tracking Entities" which tracks filenames as well in a cache service which allows it to still list files that may have an older timestamp.
Another thing to consider is listFile may list the same file more than once. Consider this scenario:
If you are in such a scenario, you would want to make use of the "Minimum File Age" property. This property tells the listFile to ignore any files were the last modified time stamp when compared to current time is not at least that configured amount of time old (that means last modified timestamp has not changed for configured amount of time). That configured time is arbitrary and what ever length is needed for you to be confident file write was complete.
Something else you need to consider depends on if both the following are true:
1. You are using a multi node NiFi cluster
2. The configured directory you are listing from is mounted to every node.
Since every node in a NiFi cluster is executing the same dataflow, you want to avoid every node from listing the same files. IN this scenario you would change the "Execution" configuration from "All nodes" to "Primary" on the ListFile and change "input Directory location" from "local" to "remote". Then you will want to set "load balance Strategy" to "Round Robin" on the connection between ListFile and FetchFile.
NOTE: Never set the Execution on any processor that has an inbound connection to "Primary node". ONLY processor with not inbound connection should be considered for this execution configuration.
I know this is a lot to digest, but very important to be aware of to ensure success.
If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.
Thank you,
Matt
Created 10-08-2021 03:51 AM
Hi,
Matt thanks for the explanation
Created 07-28-2024 01:32 AM
Hi
I am facing a issue here. If i add multipile file with same timestamp list file is taking only 1 or 2 out 20 file. Is nifi listfile tracking entities will resolve proiblem