Support Questions
Find answers, ask questions, and share your expertise

NiFi's GetHDFS processor with Cron schedule not reading all files in the directory

Rising Star

Hello,

I've NiFi (standalone instance 1.0.1) GetHDFS with this cron schedule - 0 30 0 * * ?

I want the processor to start at 12:30 AM daily; so, with the above schedule, the processor started at the expected time this morning and some files have been read, but it hasn't finished reading all the files; I had quite a few files to read in the directory yesterday and right now it still has 1200+ files left in the directory; I have the "Keep Source File" set to false, so it would/should delete the files as it reads; that shows the files left in the directory haven't been read by the processor;

My understanding is, with the above schedule, once GetHDFS starts, it should keep reading until all the files in the directory are exhausted; but I'm not understanding why some files are still left.

Please help, thank you.

1 ACCEPTED SOLUTION

Accepted Solutions

Rising Star

Thanks to @Bryan Bende, I needed to change the batch size property in GetHDFS, to read all files in the directory.

https://community.hortonworks.com/questions/108547/need-clarification-on-how-nifi-processors-run-wit...

View solution in original post

6 REPLIES 6

Contributor

@Raj B This looks similar to NIFI-4069

As a workaround, please try and change the cron schedule to 0,30 30 0 * *. so that it runs twice in the same minute.

Let us know if that helps.

Rising Star

@Shashank Chandhok the schedule change to "0,30 30 0 * * ?" helped to read few additional files, but many files still remain in the directory

Contributor

@Raj B

Please check the timestamps of the files remaining in the directory. If they are being added during the process run time. Or if the timestamp is older than the CRON runtime of the processor.

Rising Star

@Shashank Chandhok actually, the files I'm trying to process are from the day before; in my directory path in GetHDFS processor, I'm using expression language to point to the directory that was created yesterday and the files in that directory are from yesterday. So when the CRON scheduler starts at 12:30 am, all files that would need to be processed should all be there already in that directory.

Rising Star

Not sure why I need to schedule the GetHDFS processor to run continuously (I set to run every 15 seconds), but this schedule exhausts all files from the directory - 0/15 * * * * ?

In my case since I'm loading files the next day (GetHDFS directory path points to previous day's directory), this resolves the issue I was facing.

Rising Star

Thanks to @Bryan Bende, I needed to change the batch size property in GetHDFS, to read all files in the directory.

https://community.hortonworks.com/questions/108547/need-clarification-on-how-nifi-processors-run-wit...

View solution in original post