Support Questions

Find answers, ask questions, and share your expertise

Apache Nifi PutFile Processor

avatar
Explorer

How can we make the PutFile Nifi processor execute only at some specified date and time?

10 REPLIES 10

avatar
Explorer

This can be done with CRON if option is available, otherwise you can automate the process using python and nipyapi.

avatar
Explorer

Thanks for the solution.

avatar
Master Mentor

@Ankit13 

 

NiFi processors support "Timer Driven" and "Cron Driven" Scheduling Strategies.

MattWho_2-1633524301654.png

 

  • Timer driven strategy allows you to specify a time interval for scheduling (for example: "30 secs" which means processor will execute every 30 seconds.).
  • Cron Driven strategy supports a Quartz Cron [1] being used to specify when the processor should execute.

There is a third option on some processors which is Event Driven that should not be used. It was created long ago and considered experimental.  It is has since been deprecated due to improvement made in the Timer Driven strategy.  It only remains in NiFi to avoid breaking flows of those who use it when they upgrade.

 

  1. Important things to understand about your ask:
    Let's assume you configure your PutFile to execute using the Cron Driven scheduling strategy and the inbound connection to the putFile processor has multiple FlowFiles queued.  When the processor executes it will process only 1 of those FlowFiles from that inbound connection queue with default settings.  The next queued FlowFile would not get processed until the next scheduled cron execution.  While there is no way to make sure that every queued FlowFile is processed in in a single cron execution you can change the configured Run Duration:MattWho_1-1633523854231.pngThe Run Duration tells the processor to continue to use the same execution thread to execute against as many queued FlowFile as possible within the configured run duration time. Let say it takes more than 2 secs to write the very first FlowFile to the target directory.  In that case, only one FlowFile would be processed.  So there would be no perceived difference between a run duration of 0ms and 2s.
  2. In a NiFi cluster, each node in the cluster executes the dataflow against the FlowFiles queued on that same node.  So if FlowFiles were queued on the inbound connection to PutFile on all nodes, each node would execute 1 at each cron interval processing through FlowFile(s) per node as described above.

 

[1] https://community.cloudera.com/t5/forums/replypage/board-id/Questions/message-id/229905

 

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt

avatar
Explorer

Thanks for the detailed solution. But I have more than thousands of flowfiles as input to PutFile processor and the same must be processed in a given future date and time. Hence kindly request you to give suggestion on how to handle this situation for the same requirement if possible.

avatar
Explorer

This can be done with nipyapi, a python library. Check the documentation Nipyapi 

avatar
Master Mentor

@Ankit13 

 

Perhaps I don't understand your use case.

Are you saying you have a NiFi dataflow that slowly ingested Data producing FlowFiles that work their way through your dataflow to this putFile processor?
Then you want these 1000s of FlowFiles to queue up so that they can all be put to the local file system directory at the same time?

So what is being suggested by @m_adeel is to use the NIPYAPI to automated the starting and stopping of the putFile processor at a given time.  You could also do the same through NiFi REST_API  calls. You would still have the challenge of when to stop it. 
Does the source of data ever stop coming in?   
Would you be able to put all the FlowFiles from the inbound connection queue to disk before more source FlowFiles started flowing in to the queue?
Why the need to do this at a specific data and time?

Thanks,

Matt

avatar
Explorer

Yes @MattWho , you understood it correctly.

 

Let me tell you that the files in the source folder are there from the start and no more files are put in the source folder after the Nifi flow processing starts.

 

The files from the source folder need to be processed by PutFile processor in some given future specifed date and time as required by the client.  

avatar
Master Mentor

@Ankit13 

How do you know no more files we will be put after the NiFi flow processing starts?

To me in sound like the PutFile should execute at default 0 secs (as fast at it can run) and you should instead control this dataflow at the beginning were you consume the data.

For example:
In a 24 hour window data is being written to source directory to be consumed from between 00:00:00 and 16:00:00.  Then you want to write that data to target directory starting at 17:00.  So you instead setup a cron on a listFile processor to consume list the files at 17:00 and 17:01 and then have a FetchFile and PutFile running all the time so these immediately consume all the content for the listed files and write them to target directory.  Then your listFile does not execute again until same time next day or whatever you cron is.   This way the files are all listed at same time and the putFile can execute for as long as needed to write all those files to the target directory.

Hope this helps,

Matt

avatar
Master Mentor

@Ankit13 

My recommendation would be to only automate the enabling/disabling and starting/stopping of the NiFi processor component that is ingesting the data in to your NiFi dataflow and leave all downstream processors always running, so that any data that is ingested to your dataflow has every opportunity to be processed through your dataflow to the end.  When a "running" processor is schedule to execute, but has no FlowFiles queued in its inbound connection(s), it is pauses instead of running immediately over and over again to prevent excessive CPU usage, so it is safe to leave these downstream components running all the time.

Thank you,
Matt