Support Questions
Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

Need clarification on how NiFi processors run with the CRON schedule

Rising Star

Hello,

I'm trying to understand how NiFi processor "runs" work with CRON scheduler.

I understand, by default the processor is running all the time (which is the 0 sec "Timer driven" schedule in all processors by default). When a processor is scheduled to run on a CRON driven schedule, I understand that the schedule dictates when the processor is triggered to run. But once the processor is triggered to run, how long does it stay running? does it stop after a certain amount of time? the CRON run schedule only specifies when and how often the processor should be triggered to start, but where do you specify how long it should run for and then stop;

For example, let's say I set a Get* processor to run daily at 1 am; once the current system time is 1 am, the processor starts running, but does it ever stop once it is started by the scheduler, or it will stay running ? if it stays running, then it doesn't need to be triggered by the scheduler again the next day at 1 am, because it would already be running, right ? if it does stop after the scheduler triggers it to start, how long after starting does the processor stop and where do you specify how long should the processor run for.

Thank you.

1 ACCEPTED SOLUTION

When a processor is started, it shows green which means it is scheduled to run accordingly to the scheduling strategy, and when its stopped it shows red which means it is not scheduled to run. So even if a processor is on a CRON schedule for once a day, it will be green all the time because its still scheduled, it will only be red if you specifically say to stop the processor.

When the processing is triggered to run on a CRON schedule, it doesn't run for a certain amount of time, it runs once (one call to onTrigger of the processor), so it depends what the call to onTrigger does...

GetHDFS has a Batch Size property which specifies how many files to pull in one execution, so you would need your batch size to be greater than however many files are going to be in the directory so that it can grab them all in one execution.

Alternatively, ListHDFS should list all files newer since lasting, so you could use ListHDFS + FetchHDFS.

View solution in original post

4 REPLIES 4

Using a CRON schedule means the framework will trigger the processor to run once at the specified time, meaning the onTrigger method of the processor will be executed once. The processor does not remain running. CRON is really intended for source processors to schedule pulling data from somewhere at a specified time. For processors in the middle of the flow they should typically be Time Driven with run schedule of 0.

Rising Star

@Bryan Bende thank you.

this is my use case: GetHDFS is on CRON schedule to run daily at 12:30 am, to process files that were inserted in a HDFS directory; these files would be created the previous day.

The GetHDFS processor does start at 12:30 am, as expected, but not all files from the directory are processed. So, ti seems the processor is not staying in the running state until all the files are processed. Is that the expected behavior since you are saying "the processor does not remain running." So, 1) at what time (how long after starting) does the processor stop 2) how do you control how long the processor should stay running after it was triggered to start (in my case, to let all files be processed) 3) when the processor is not running, does the icon on the processor(in the NiFi UI) change to the stopped state ?

Thanks.

When a processor is started, it shows green which means it is scheduled to run accordingly to the scheduling strategy, and when its stopped it shows red which means it is not scheduled to run. So even if a processor is on a CRON schedule for once a day, it will be green all the time because its still scheduled, it will only be red if you specifically say to stop the processor.

When the processing is triggered to run on a CRON schedule, it doesn't run for a certain amount of time, it runs once (one call to onTrigger of the processor), so it depends what the call to onTrigger does...

GetHDFS has a Batch Size property which specifies how many files to pull in one execution, so you would need your batch size to be greater than however many files are going to be in the directory so that it can grab them all in one execution.

Alternatively, ListHDFS should list all files newer since lasting, so you could use ListHDFS + FetchHDFS.

Rising Star

Thank you, it's the batch schedule that needed to be changed in my case.