Created 06-25-2017 03:39 PM
Hello,
I've NiFi (standalone instance 1.0.1) GetHDFS with this cron schedule - 0 30 0 * * ?
I want the processor to start at 12:30 AM daily; so, with the above schedule, the processor started at the expected time this morning and some files have been read, but it hasn't finished reading all the files; I had quite a few files to read in the directory yesterday and right now it still has 1200+ files left in the directory; I have the "Keep Source File" set to false, so it would/should delete the files as it reads; that shows the files left in the directory haven't been read by the processor;
My understanding is, with the above schedule, once GetHDFS starts, it should keep reading until all the files in the directory are exhausted; but I'm not understanding why some files are still left.
Please help, thank you.
Created 06-28-2017 05:20 PM
Thanks to @Bryan Bende, I needed to change the batch size property in GetHDFS, to read all files in the directory.
Created 06-26-2017 10:08 AM
Created 06-26-2017 01:41 PM
@Shashank Chandhok the schedule change to "0,30 30 0 * * ?" helped to read few additional files, but many files still remain in the directory
Created 06-26-2017 02:36 PM
Please check the timestamps of the files remaining in the directory. If they are being added during the process run time. Or if the timestamp is older than the CRON runtime of the processor.
Created 06-26-2017 03:06 PM
@Shashank Chandhok actually, the files I'm trying to process are from the day before; in my directory path in GetHDFS processor, I'm using expression language to point to the directory that was created yesterday and the files in that directory are from yesterday. So when the CRON scheduler starts at 12:30 am, all files that would need to be processed should all be there already in that directory.
Created 06-26-2017 03:09 PM
Not sure why I need to schedule the GetHDFS processor to run continuously (I set to run every 15 seconds), but this schedule exhausts all files from the directory - 0/15 * * * * ?
In my case since I'm loading files the next day (GetHDFS directory path points to previous day's directory), this resolves the issue I was facing.
Created 06-28-2017 05:20 PM
Thanks to @Bryan Bende, I needed to change the batch size property in GetHDFS, to read all files in the directory.