Support Questions

Find answers, ask questions, and share your expertise

NiFi stuck data in queue between processors

avatar
Explorer

Hey, communiy!
We have a NiFi Cluster 1.18.0 (still so, yeap, sorry) and next issue do my mind.

Simple flow, where we read data from Kafka with ConsumeKafka Processor and process it with EvaluateJsonPath after.

One time we can see that queue between processors are stuck with data and nothing happens while EvaluateJsonPath or whole canvas not restared (right click -> stop -> start):

Screenshot 2025-08-11 at 12.06.53.png

Connection settings:

  • Back Pressure Threshold: 10 000;
  • Size Threshold: 1 Gb
  • LB Strategy: Round Robin

Queue threshold exceeded only on 3rd cluster node:Screenshot 2025-08-11 at 16.33.52.png

 

Additional configuration data:

  • Maximum timer driven thread count: 400
  • 112 cores per node (2 CPU, 28 physical cores, 112 total with hyperthreading) 

So, as soon as I stop and start canvas -- all works fine again. Why it happens? How can I find the reason? 

3 REPLIES 3

avatar
Master Mentor

@asand3r 

Here are my observations from  what you have shared:

  1. You appear to be having a load balancing issue.  The LB icon on the connection indicates it is trying to actively load balance FlowFiles in that connection.  MattWho_0-1755000276724.png If load balancing is complete the  icon will look like this:  MattWho_1-1755000344777.png.  The  shared queue counts show that 1 node has reached queue threshold which would prevent it from receiving any more FlowFiles which includes from other nodes.  So I assume that the ~4,600 FlowFiles on first 2 nodes are destined for that third node but can not be sent because of queue threshold.  
  2. Considering the observation above, i would focus you attention on the node with the queue threshold. Maybe disconnect it from cluster via the cluster UI and inspect the flow on that node directly.  Check the logs on that third node for any reported error or warn issues.
  3. Perhaps EvaluateJson processor only on node 3 is having issues?  Maybe connectivity between first two nodes and and third node is having issues. Node 3 can't distribute any FlowFiles to nodes 1 or 2.  And nodes 1 and  2 can't distribute to node 3.  Maybe some sync issue happened node 3 for some reason has processor stopped. This may explain the stop and start of canvas getting things moving again.  If you just disconnect node 3 (one with queue threshold exceeded) only and then reconnect it back to cluster, do FlowFiles start moving?  At reconnection, node 3 will compare its local flow with cluster flow.  If you remove the LB connection configuration, do FlowFiles get processed?
  4. I am curious why in this flow design you have setup a LB connection after the ConsumeKafka processor.  This processor creates a consumer group and should be configured according the number of partitions on the target topic to maximize throughput and preventing rebalancing. Let's say your topic has 15 partitions for example. Your consumeKafka processor would then be configured with 5 concurrent tasks.  3 nodes X 5 tasks = 15 consumers in the consumer group.  each consumer is assigned to a partition.   This spread the consumption across all your nodes removing the need for the load balanced connection configuration.
  5. You are using a rather old version of Apache NiFi (~4 years old).  I'd encourage you to upgrade to take advantage of many bug, improvements, and security fixes.

 

Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

avatar
Explorer

Thanks, @MattWho for your points to LoadBalance. The 3rd node really had network connection issues that time, so maybe it takes place. For now it works fine, so I cannot do test steps that you offer.

But I don't fully get your point about LB after ConsumeKafka.
If Load balance is enabled is queue between ConsumeKafka and EvaluateJsonPath I can see that data is distributes along all cluster nodes in Data Provenance (look at screenshot below) , but if I disable it, only one node is presents here:

asand3r_0-1755523440422.png

My configuration with RoundRobin LB is wrong?

 

avatar
Master Mentor

@asand3r 

The question is how many partitions does he target Kafka topic have?
If it only has 1 partition, then only one node in the consumeKafka consumer group is going to consume all the messages.  Since you are saying that when LB on connection is disabled and queue shows all FlowFiles on one node, that tells me you have just one partition. 

For optimal throughput you would want some multiple of the number of nodes as the partition count.

With 3 NiFi nodes, you would want 3, 6, 9, etc partitions.  With 3 partitions and 1 concurrent task set on your consumeKafka, you will have 3 consumers in the consumer group.  Each node will consume  from one of those partitions.  If you have 6 partitions, and 2 concurrent tasks set on the consumeKafka processor, you will have 6 consumers (3 nodes x 2 concurrent tasks) in  your consumer group.  So each node's consumeKafka will be able to concurrently pull from 2 partitions.  

So while your LB allows for redistribution of FlowFiles being consumed by one node, it is not the optimal setup here.  LB connection will not change how the processor functions, it only operates against the FlowFiles output from the feeding processor.  LB connection setting are most commonly used on downstream connection of processors that are typically scheduled on primary node only (listFile, ListSFTP, etc).

Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt