Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Custom processor and backpressuring, throttling


Custom processor and backpressuring, throttling

Expert Contributor

NiFi 1.2.0

There is a custom processor that reads data from db and passes it further. In a recent stress testing, the 'success' relationship queue was clogged and also the later flow as the processor dumped hundred thousands of flow files of several GBs. Obviously, the backpressuring was not implemented. I also read an informative post about throttling and backpressuring.

What I have figured out is that backpressuring is something we configure in the relationship queue and standard processors like ControlRate can help to regulate the data flow.

Question :

Is additional coding required(e.g: Some interface to be implemented) in the processor to enable it to 'sleep/stop consuming data' for backpressuring or does the NiFi framework handle that, once the 'success' relationship of the processor is configured for backpressuring


Re: Custom processor and backpressuring, throttling

Master Guru
@Kaliyug Antagonist

Generally speaking the NiFi controller handles enforcing back pressure constraints on processors feeding a connection where backpressure thresholds have been configured.

First is understanding how backpressure threshold work. These configured thresholds are soft limits. This means the preceding processor will be allowed to trigger as long as none of its outgoing connections have reached or exceeded the configured backpressure. The Controller is responsible for providing these processor the thread to execute with. So lets assume a backpressure threshold of 10000 objects as ben set and the connection currently has 9,900 objects (FlowFiles). Since the threshold has not been exceeded, the feeding processor will be given a thread by the controller when its next run schedule occurs. Let assume this particular processor works on batches of 1000 FlowFiles per execution. At the end of its execution the downstream connection will now have 10,900 queued FlowFiles. The controller will no longer provide a thread to the feeding processor when its run schedule occurs until the downstream connection has again dropped below teh configured backpressure threshold.

So if a processor has a long running thread that ingests 1 million FlowFiles before thread ends, the configured threshold will not be enforced very well. Processors like GetTCP (with large batch size) and ConnectWebSocket are couple examples. In cases like this you will see backpressure setting on downstream connection of this processor has little affect. The queue may climb well over the configured threshold before the brakes are applied to the feeding processor.

Then you have processors like ListenTCP or ListenUDP for example. These processor when started bind to a port and listen for incoming connections. They have an internal queue which FlowFile are created from based on the configured run schedule. Back pressure on a downstream connection will result in this processor no longer reading files off that internal queue. In the case of ListenTCP, this means the internal queue will eventually not be able to accept new files pushing back on the source. For ListenUDP, it likely means data loss once internal queue is full.



Don't have an account?
Coming from Hortonworks? Activate your account here