Support Questions

asand3r · ‎08-12-2025

Hey, communiy!
We have a NiFi Cluster 1.18.0 (still so, yeap, sorry) and next issue do my mind.

Simple flow, where we read data from Kafka with ConsumeKafka Processor and process it with EvaluateJsonPath after.

One time we can see that queue between processors are stuck with data and nothing happens while EvaluateJsonPath or whole canvas not restared (right click -> stop -> start):

Connection settings:

Back Pressure Threshold: 10 000;
Size Threshold: 1 Gb
LB Strategy: Round Robin

Queue threshold exceeded only on 3rd cluster node:

Additional configuration data:

Maximum timer driven thread count: 400
112 cores per node (2 CPU, 28 physical cores, 112 total with hyperthreading)

So, as soon as I stop and start canvas -- all works fine again. Why it happens? How can I find the reason?

MattWho · ‎08-12-2025

@asand3r

Here are my observations from what you have shared:

You appear to be having a load balancing issue. The LB icon on the connection indicates it is trying to actively load balance FlowFiles in that connection. If load balancing is complete the icon will look like this: . The shared queue counts show that 1 node has reached queue threshold which would prevent it from receiving any more FlowFiles which includes from other nodes. So I assume that the ~4,600 FlowFiles on first 2 nodes are destined for that third node but can not be sent because of queue threshold.
Considering the observation above, i would focus you attention on the node with the queue threshold. Maybe disconnect it from cluster via the cluster UI and inspect the flow on that node directly. Check the logs on that third node for any reported error or warn issues.
Perhaps EvaluateJson processor only on node 3 is having issues? Maybe connectivity between first two nodes and and third node is having issues. Node 3 can't distribute any FlowFiles to nodes 1 or 2. And nodes 1 and 2 can't distribute to node 3. Maybe some sync issue happened node 3 for some reason has processor stopped. This may explain the stop and start of canvas getting things moving again. If you just disconnect node 3 (one with queue threshold exceeded) only and then reconnect it back to cluster, do FlowFiles start moving? At reconnection, node 3 will compare its local flow with cluster flow. If you remove the LB connection configuration, do FlowFiles get processed?
I am curious why in this flow design you have setup a LB connection after the ConsumeKafka processor. This processor creates a consumer group and should be configured according the number of partitions on the target topic to maximize throughput and preventing rebalancing. Let's say your topic has 15 partitions for example. Your consumeKafka processor would then be configured with 5 concurrent tasks. 3 nodes X 5 tasks = 15 consumers in the consumer group. each consumer is assigned to a partition. This spread the consumption across all your nodes removing the need for the load balanced connection configuration.
You are using a rather old version of Apache NiFi (~4 years old). I'd encourage you to upgrade to take advantage of many bug, improvements, and security fixes.

Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

View solution in original post

MattWho · ‎08-12-2025

@asand3r

Here are my observations from what you have shared:

You appear to be having a load balancing issue. The LB icon on the connection indicates it is trying to actively load balance FlowFiles in that connection. If load balancing is complete the icon will look like this: . The shared queue counts show that 1 node has reached queue threshold which would prevent it from receiving any more FlowFiles which includes from other nodes. So I assume that the ~4,600 FlowFiles on first 2 nodes are destined for that third node but can not be sent because of queue threshold.
Considering the observation above, i would focus you attention on the node with the queue threshold. Maybe disconnect it from cluster via the cluster UI and inspect the flow on that node directly. Check the logs on that third node for any reported error or warn issues.
Perhaps EvaluateJson processor only on node 3 is having issues? Maybe connectivity between first two nodes and and third node is having issues. Node 3 can't distribute any FlowFiles to nodes 1 or 2. And nodes 1 and 2 can't distribute to node 3. Maybe some sync issue happened node 3 for some reason has processor stopped. This may explain the stop and start of canvas getting things moving again. If you just disconnect node 3 (one with queue threshold exceeded) only and then reconnect it back to cluster, do FlowFiles start moving? At reconnection, node 3 will compare its local flow with cluster flow. If you remove the LB connection configuration, do FlowFiles get processed?
I am curious why in this flow design you have setup a LB connection after the ConsumeKafka processor. This processor creates a consumer group and should be configured according the number of partitions on the target topic to maximize throughput and preventing rebalancing. Let's say your topic has 15 partitions for example. Your consumeKafka processor would then be configured with 5 concurrent tasks. 3 nodes X 5 tasks = 15 consumers in the consumer group. each consumer is assigned to a partition. This spread the consumption across all your nodes removing the need for the load balanced connection configuration.
You are using a rather old version of Apache NiFi (~4 years old). I'd encourage you to upgrade to take advantage of many bug, improvements, and security fixes.

Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

asand3r · ‎08-18-2025

Thanks, @MattWho for your points to LoadBalance. The 3rd node really had network connection issues that time, so maybe it takes place. For now it works fine, so I cannot do test steps that you offer.

But I don't fully get your point about LB after ConsumeKafka.
If Load balance is enabled is queue between ConsumeKafka and EvaluateJsonPath I can see that data is distributes along all cluster nodes in Data Provenance (look at screenshot below) , but if I disable it, only one node is presents here:

My configuration with RoundRobin LB is wrong?

MattWho · ‎08-26-2025

@asand3r

The question is how many partitions does he target Kafka topic have?
If it only has 1 partition, then only one node in the consumeKafka consumer group is going to consume all the messages. Since you are saying that when LB on connection is disabled and queue shows all FlowFiles on one node, that tells me you have just one partition.

For optimal throughput you would want some multiple of the number of nodes as the partition count.

With 3 NiFi nodes, you would want 3, 6, 9, etc partitions. With 3 partitions and 1 concurrent task set on your consumeKafka, you will have 3 consumers in the consumer group. Each node will consume from one of those partitions. If you have 6 partitions, and 2 concurrent tasks set on the consumeKafka processor, you will have 6 consumers (3 nodes x 2 concurrent tasks) in your consumer group. So each node's consumeKafka will be able to concurrently pull from 2 partitions.

So while your LB allows for redistribution of FlowFiles being consumed by one node, it is not the optimal setup here. LB connection will not change how the processor functions, it only operates against the FlowFiles output from the feeding processor. LB connection setting are most commonly used on downstream connection of processors that are typically scheduled on primary node only (listFile, ListSFTP, etc).

Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

asand3r · ‎09-14-2025

Thank you for such detailed answer. It's very helpful.

asand3r · ‎09-15-2025

@MattWho

We have 30 partitions for that topic and Concurrent tasks set to 5.

MattWho · ‎09-15-2025

@asand3r

With your ConsumeKafka processor configured with 5 concurrent tasks and a NiFi cluster with 3 nodes, you will have 15 (3 nodes X 5 concurrent tasks) consumers in your consumer group. So Kafka will assign two partitions to each consumer in that consumer group. Now if there are network issues, Kafka may do a rebalance and assign more partitions to fewer consumers. (Of course consumers in a consumer group changes if you have additional consumeKafka processors pointing at same topic and configured with same consumer group id.

Matt

asand3r · ‎09-18-2025

@MattWho wrote:
Now if there are network issues, Kafka may do a rebalance and assign more partitions to fewer consumers.

We have an issue with JVM stop-the-world because. We are still using Java 8.0 in this cluster, so periodically JVM freezes about 5-15 seconds to perform GC.

May it be a cause of LoadBalance issue?

MattWho · ‎09-18-2025

@asand3r

JVM Garbage collection is stop-the-world which would prevent for the duration of that GC event the Kafka clients from communicating with Kafka. If that duration of pause is long enough I could cause Kafka to do a rebalance. I can't say that you are experiencing that . Maybe put the consumeKafka processor class in INFO level logging and monitor the nifi-app.log for any indication of rebalance happening.

When it comes GC pauses, a common mistake I see is individuals setting the JVM heap settings in NiFi way to high simply because the server on which they have in stalled NiFi has a lot of installed memory. Since GC only happens once the allocated JVM memory utilization reaches around 80%, large heaps could lead to long stop-the-world if there is a lot top clean-up to do.

Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

Cloudera Community

Support Questions

NiFi stuck data in queue between processors