Created 08-12-2025 02:54 AM
Hey, communiy!
We have a NiFi Cluster 1.18.0 (still so, yeap, sorry) and next issue do my mind.
Simple flow, where we read data from Kafka with ConsumeKafka Processor and process it with EvaluateJsonPath after.
One time we can see that queue between processors are stuck with data and nothing happens while EvaluateJsonPath or whole canvas not restared (right click -> stop -> start):
Connection settings:
Queue threshold exceeded only on 3rd cluster node:
Additional configuration data:
So, as soon as I stop and start canvas -- all works fine again. Why it happens? How can I find the reason?
Created 08-12-2025 05:29 AM
@asand3r
Here are my observations from what you have shared:
Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt
Created 08-12-2025 05:29 AM
@asand3r
Here are my observations from what you have shared:
Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt
Created 08-18-2025 06:29 AM
Thanks, @MattWho for your points to LoadBalance. The 3rd node really had network connection issues that time, so maybe it takes place. For now it works fine, so I cannot do test steps that you offer.
But I don't fully get your point about LB after ConsumeKafka.
If Load balance is enabled is queue between ConsumeKafka and EvaluateJsonPath I can see that data is distributes along all cluster nodes in Data Provenance (look at screenshot below) , but if I disable it, only one node is presents here:
My configuration with RoundRobin LB is wrong?
Created 08-26-2025 05:45 AM
The question is how many partitions does he target Kafka topic have?
If it only has 1 partition, then only one node in the consumeKafka consumer group is going to consume all the messages. Since you are saying that when LB on connection is disabled and queue shows all FlowFiles on one node, that tells me you have just one partition.
For optimal throughput you would want some multiple of the number of nodes as the partition count.
With 3 NiFi nodes, you would want 3, 6, 9, etc partitions. With 3 partitions and 1 concurrent task set on your consumeKafka, you will have 3 consumers in the consumer group. Each node will consume from one of those partitions. If you have 6 partitions, and 2 concurrent tasks set on the consumeKafka processor, you will have 6 consumers (3 nodes x 2 concurrent tasks) in your consumer group. So each node's consumeKafka will be able to concurrently pull from 2 partitions.
So while your LB allows for redistribution of FlowFiles being consumed by one node, it is not the optimal setup here. LB connection will not change how the processor functions, it only operates against the FlowFiles output from the feeding processor. LB connection setting are most commonly used on downstream connection of processors that are typically scheduled on primary node only (listFile, ListSFTP, etc).
Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt
Created 09-14-2025 01:52 AM
Thank you for such detailed answer. It's very helpful.
Created 09-15-2025 02:59 AM
We have 30 partitions for that topic and Concurrent tasks set to 5.
Created 09-15-2025 06:36 AM
@asand3r
With your ConsumeKafka processor configured with 5 concurrent tasks and a NiFi cluster with 3 nodes, you will have 15 (3 nodes X 5 concurrent tasks) consumers in your consumer group. So Kafka will assign two partitions to each consumer in that consumer group. Now if there are network issues, Kafka may do a rebalance and assign more partitions to fewer consumers. (Of course consumers in a consumer group changes if you have additional consumeKafka processors pointing at same topic and configured with same consumer group id.
Matt
Created 09-18-2025 03:37 AM
@MattWho wrote:Now if there are network issues, Kafka may do a rebalance and assign more partitions to fewer consumers.
We have an issue with JVM stop-the-world because. We are still using Java 8.0 in this cluster, so periodically JVM freezes about 5-15 seconds to perform GC.
May it be a cause of LoadBalance issue?
Created 09-18-2025 08:26 AM
JVM Garbage collection is stop-the-world which would prevent for the duration of that GC event the Kafka clients from communicating with Kafka. If that duration of pause is long enough I could cause Kafka to do a rebalance. I can't say that you are experiencing that . Maybe put the consumeKafka processor class in INFO level logging and monitor the nifi-app.log for any indication of rebalance happening.
When it comes GC pauses, a common mistake I see is individuals setting the JVM heap settings in NiFi way to high simply because the server on which they have in stalled NiFi has a lot of installed memory. Since GC only happens once the allocated JVM memory utilization reaches around 80%, large heaps could lead to long stop-the-world if there is a lot top clean-up to do.
Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt