Support Questions

Find answers, ask questions, and share your expertise

How to schedule process to fetch only new files from a directory in apache nifi?

avatar
Explorer

Hi,

 

I am looking to fetch only new files added in the directory exactly one time and once file is picked it should not be picked again in apache nifi. I want to schedule this process to to every 3 hours. Please provide solution with screenshot the properties you used to do this process or which processors you are using. I am bit confused between listfile getfile and fetchfile and which properties to used.

 

Any help in this issue will be greatly appreciated.

Thank You!

2 ACCEPTED SOLUTIONS

avatar
Super Guru

Once it brings it it wont bring again because it will save its timestamp and then use that to get newer files added and so on.

View solution in original post

avatar
Master Mentor

@CodeLa @SAMSAL 

I want to point out that tracking timestamps will not always guarantee NiFi will consume all files from the input file directory depending on how they are being placed in that directory.

The ListFile processor looks at the last modified timestamp on the file.  It then lists all files since the last recorded timestamp stored in NiFi state manager from the previous processor execution.  On first run their will be no state and this everything currently is listed.

Now consider the scenarios below which can affect above from listing all files:

  • The mechanism that is writing the files to that inout directory is not updating the last modified timestamp on the file once it is done writing to it.  Let say we have file 1 that starts being written to as 12:00:01.000 and file 2 that starts being written as 12:00:01.300. File 2 completes first and is consumed by listFile and stored state is updated to reflect 12:00:01.300.  Now File 1 completes, but is never consumed by ListFile since its last modified timestamp is older than file 2.

If you are in such a scenario, the ListFile offers a different "Listing Strategy" called "Tracking Entities" which tracks filenames as well in a cache service which allows it to still list files that may have an older timestamp.

Another thing to consider is listFile may list the same file more than once. Consider this scenario:

  • You tell NiFi ListFile to list files from directory /nifi/myfiles/.  The mechanism writing these files to the target directory does update the last modified timestamp as file is being written, but does not use a ".<filename>" (dot rename) approach to writing these files (means file is initially a hidden file until file write completes and then is renamed and made unhidden. Default listFile config ignores hidden files).  So when ListFile runs, it sees that file with newer last modified timestamp and lists it.  Then on next execution it sees same file again because its last modified timestamp is updated as file is still being written to.

If you are in such a scenario, you would want to make use of the "Minimum File Age" property.  This property tells the listFile to ignore any files were the last modified time stamp when compared to current time is not at least that configured amount of time old (that means last modified timestamp has not changed for configured amount of time).  That configured time is arbitrary and what ever length is needed for you to be confident file write was complete. 

Something else you need to consider depends on if both the following are true:

1. You are using a multi node NiFi cluster
2. The configured directory you are listing from is mounted to every node.

Since every node in a NiFi cluster is executing the same dataflow, you want to avoid every node from listing the same files. IN this scenario you would change the "Execution" configuration from "All nodes" to "Primary" on the ListFile and change "input Directory location" from "local" to "remote".  Then you will want to set "load balance Strategy" to "Round Robin" on the connection between ListFile and FetchFile.

NOTE: Never set the Execution on any processor that has an inbound connection to "Primary node".  ONLY processor with not inbound connection should be considered for this execution configuration.

I know this is a lot to digest, but very important to be aware of to ensure success.

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt 

View solution in original post

11 REPLIES 11

avatar
Master Mentor

@varungupta 

This is a ~3 year old post with an already accepted answer.  You are likely to get more responsive answers if you were to start a new thread.  NiFi would have also evolved considerable over the past 3 years.

Yes, tracking entities does not rely on timestamps to ensure listing of new FlowFiles and will help you here.  NiFi grabbing 1 -2  of 20 is more then just timestamps, I suspect that how the files are being moved into the consumption directory is also impacting you.

Tracking Timestamps is easiest and least resource consumption default setup, but does not work for all use cases.  
Timestamp is based on the last modified timestamp.  When listing is performed it lists all Files with last processor state stored timestamp up to most recent file's last modified timestamp.  Problem can happen if last modified timestamp is not updated. 

For example some system writes to directory A on your local machine and after write completes, it moves file to Directory B.  With that atomic move the file timestamp is not updated.  If the move does not happen fast enough it may get missed in the current listing. it is also possible that a moved file has an older last modified timestamp that another smeller files moved quicker to dir B.  Thus resulting a timestamp stored in state that would be newer and thus resulting in that other file being ignored.

Tracking entities was added to  solution to these types of problems.

Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

avatar
Explorer

Thanks a lot Matt for the answer.