Created on 05-12-2026 03:07 PM - edited 05-12-2026 10:31 PM
We have Nifi 2.5.0 and a problem whereby some files seem to be "stuck" after leaving a 3 node cluster, and get "stuck" in a site-to-site connection queue after the cluster.
We ingest 60,000 small files out of a series of folders that are nested to a depth of 10 folders.
Most files seem to move through the cluster OK, but we get a few that seem to enter the cluster fine , but then have a problem leaving it. Its different files each time and they sit there for 30-60 mins and eventually move on. Its not the same files each time.
The network and compute resources are fine, the memory that Nifi is using is about 47% of the JVM maximum. Disk space doesn't seem to be an issue either. I had a look through the nifi-app.log file and nothing appears to show as an issue. Cluster heartbeat is average 10 milliseconds, GC runs in 50 milliseconds every 30 seconds.
Would welcome some suggestions.
Thank you in advance.
Created 05-13-2026 04:48 AM
@zzzz77
Sorry to hear you are having challenges with your dataflow.
Can you clarify "site-to-site connection queue after the cluster" as this is not very clear.
Thank you,
Matt
Created on 05-14-2026 04:32 PM - edited 05-14-2026 07:10 PM
Hi Matt
Sorry its taken a few days to post back, we've been running tests and observing behaviour of the clusters.
So we have nifi 2.5.0 on a 3 node cluster on ubuntu, and we run zookeeper as the cluster manager. We have a client Windows 11 machine with nifi 2.5.0 on it that sends data files to the cluster, the cluster processes them and sends acknowledgement files ( one each per data file ) back via Nifi to the windows machine. Its at this point where the acknowledgement files get stuck between the cluster and the client machine.
We've done belt and braces baselining of zookeeper - looking at logs, making sure NTP had node times correct, memory correct, looked for errors in logs. That came up clean.
In the nifi cluster its also 3 node, and what we did last night was delete the S2S output port and that seemed to fix it. But only temporarily - the files are getting stuck again this morning. This port connects to a down stream windows machine with nifi 2.5.0 on it.
We have the port opened right up to 100,000 files and 1TB for back pressure. Penalty is 30 seconds. Traffic seems to flow reasonably evenly between all 3 nifi nodes.
What we have observed and i don't know if its related, but in our non-clustered lab set up, we observed slow site to site between dockers ( 2 dockers on same ubuntu machine ), and so we had to install nifi onto linux directly which seemed to fix it. Then it slowed down again - so we deleted everything on the canvas and re-created it by importing the canvas backup JSON file we took before wiping everything .It then ran OK and seemed happy enough, with decent throughput at levels we would expect.
We also observed files would "disappear" when we had 2 downstream sites connected to the same output port, we think there is a not-so-"round robin" happening, whereby we have files moving to one downstream site and others going to a second site but not in any particular sequence which makes them look like on one site they have "disappeared".
We still have files being stuck again. I'm just keen to get to the bottom of it. 🙂
We
Created 05-15-2026 09:15 AM
@zzzz77
Still not clear on the entire workflow you have going on between your Windows NiFi and 3 node NiFi Cluster.
1. "where the acknowledgement files get stuck between the cluster and the client machine" - not clear in where exactly this means. Queued in some connection with the NiFi canvas between what two components? Can you share screenshots of your dataflow setup showing where they get stuck? Is back pressure being applied to any of the connections?
2. "We also observed files would "disappear" when we had 2 downstream sites connected to the same output port" - please elaborate here. It is common to have multiple RPGs connected to remote ports (if the RPG is on a 3 nodeNiFi cluster, then you have 3 RPGs attempting to pull data from the target Remote output port). No different then if you had three non clustered standalone NiFi instances all connected to same Remote OutPut Port (all are trying to connect constantly and get FlowFiles from the port). I have not observed FlowFiles "disappear" with RPG. Did you use NiFi's built in Provenance to search for a "disappeared" FlowFile using the lineage?
I would recommend not building your flow around "Remote output ports". There is no good load distribution happening with "Remote Output ports". The "Remote Process Group" (RPG) is the client in all Site-To-Site transfers. When you add and configure a RPG, it will connect to the target the first Target URL configured and fetch the Site-to-Site (S2S) details for that target cluster. While you can configure multiple URLs in a comma separated list in the RPG, it only attempts the next configured URL if first is not reachable. Those S2S details contain info from the target NiFi instance/cluster (to include but not limited to: num nodes in cluster, support protocol, s2s ports, individual node load, etc). These details allow the RPG to create a distribution plan. lets assume target is a 3 node cluster and node one has 10,000 queued FlowFiles, node 2 has 5,000, and node 3 has 5,000 queued FlowFiles. Since node 1 has a higher load, the RPG would try to make sure nodes 2 and 3 got more FlowFile sent to them. So distribution might result in transfers don in order like this: (node 1, node 2, node 3, node 2, node 3, repeat from beginning). So you will notice with each iteration it send twice to nodes 2 and 3 and only once to node 1. The RPG has no round robin configuration. But even with above, there is no guarantee of any even or close to even load distribution. Under continuous dataflow load, you will see pretty good distribution, but the fewer FlowFiles the less distributed it can become.
Below Article covers the settings that can help improve the distribution of FlowFile across nodes via RPG.
https://community.cloudera.com/t5/Community-Articles/How-to-achieve-better-load-balancing-using-NiFi...
Form above article you learn what configurations exist when "sending" FlowFiles to a Remote Input Port. But Output Ports are different. The RPG (client) is still connected to the "output port" of your target NiFi instance or cluster. Lets say the RPG is on you 3 node cluster, so that means that each node in the 3 node cluster has its own copy of the RPG executing. So each node polls the output port to fetch FlowFiles. There are no controls to limit how much data 1 node's RPG may pull. It simply connects and pulls everything currently based on output port config settings on RPG, but there is no distribution model since it is a pull. So as soon as it finishes it will attempt again. So you have less control over over distribution when using Remote Output Ports.
Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt