There is a custom processor that reads data from SQL Server and creates a flow file per row. This causes creation of millions of flow-files and the destination queue often gets full. The objective is to prevent triggering of the processor if the destination queue is full.
The background thread about the backpressuring & throttling .
In our dev. env., in spite of adding 'Back Pressure Object Threshold' and 'Back Pressure Data Size Threshold', the processor executed every (scheduled)5 minutes even when the destination queue had far more number of files than configured. I am a bit confused about the following said in the documentation :
Several factors exist that will contribute to when a Processor’s
onTriggermethod is invoked. First, the Processor will not be triggered unless a user has configured the Processor to run. If a Processor is scheduled to run, the Framework periodically (the period is configured by users in the User Interface) checks if there is work for the Processor to do, as described above. If so, the Framework will check downstream destinations of the Processor. If any of the Processor’s outbound Connections is full, by default, the Processor will not be scheduled to run.
Can any of the below ways prevent processor triggering the moment :
TriggerWhenAnyDestinationAvailable. It's documentation says :
By default, NiFi will not schedule a Processor to run if any of its outbound queues is full. This allows back-pressure to be applied all the way a chain of Processors. However, some Processors may need to run even if one of the outbound queues is full. This annotations indicates that the Processor should run if any Relationship is "available." A Relationship is said to be "available" if none of the connections that use that Relationship is full. For example, the DistributeLoad Processor makes use of this annotation. If the "round robin" scheduling strategy is used, the Processor will not run if any outbound queue is full. However, if the "next available" scheduling strategy is used, the Processor will run if any Relationship at all is available and will route FlowFiles only to those relationships that are available.
Back-pressure will stop the processor from executing, but back-pressure can only be detected if the queue is full before the processor is going to execute. Take this example... Processor A and Processor B and the queue between them has back-pressure set to 100 flow files and currently there are 99 flow files in the queue, so Processor A gets triggered because the 100 flow file limit hasn't been reached yet. Now if this single execution of Processor A produces 100 flow files, it will be allowed to put all those in the queue and there will be 199, and now Processor A won't execute anymore because on the next check it will be over 100.
There is a method on the context that you could use to check if any outbound queues have space:
/** * @return the set of all relationships for which space is available to * receive new objects */
You should also consider trying to not produce millions of small flow files, you will get significantly better performances if you can write batches of thousands, or even tens of thousands, of rows to a single flow file. This will depend on what you are doing in the rest of the flow, but many of the "record" processors were introduced so that records could be passed around in batches and processed in place.