Continually getting the below error:
Response time from <localhost> was slow for each of the last 3 requests made. To see more information about timing, enable DEBUG logging for org.apache.nifi.cluster.coordination.http.replication.ThreadPoolRequestReplicator
Operating in a five node cluster, running batch jobs daily
What can I do to stop getting this error?
Not sure what NiFi version you are running; however, there is a configurable property for setting the number of node protocol threads.
As you can see the default setting is only 10. As you increase the size of your cluster there is also a rise in the number of communication that occur between those nodes and the currently elected cluster coordinator.
Do you still see the issue if you change the value of this property in the nifi.properties file on all your nodes to 50?
You may also consider bumping up the values for the following properties as well:
nifi.cluster.node.connection.timeout=5 sec nifi.cluster.node.read.timeout=5 sec
This may help when network bandwidth may be limited under times of heavy traffic/load.
There are our current settings:
Should they be bumped up any further?
What is unique about the node that the slowness is reported for? Is it the currently elected Cluster Coordinator or Primary Node? It may be doing more work then your other nodes depending on your dataflow resulting in slower response time.
Do you see high memory or CPU usage on that node?
I see you are running a new enough NiFi instance where nifi.cluster.node.protocol.max.threads=50 was added. You could try bumping this up. But if you systems CPUs are already heavy utilized, it may not make much of a difference.
Are you experiencing node disconnections?
Try bumping up
Are you seeing heap issues?
While Minor (Young) Garbage Collection (GC) is healthy and normal, a lot of Full (old) GC is not. All GC is a stop the world event. Full GC takes longer to complete. You can see if any full GC is occurring by going in to "summary" and clicking "system diagnostics" link in lower right corner.
If GC is occurring, you may be able to improve responsiveness by bumping up your heap size setting for your nifi instance via the nifi bootstrap.conf file.
# JVM memory settings java.arg.2=-Xms512m java.arg.3=-Xmx512m
The defaults are pretty low at only 512m. Depending on memory availability on your nodes, you could push this up as well. If you already have this pushed up, you will need to work through your dataflow design and reduce your heap footprint.
Things like MergeContent (when merging a very large (20,000+) flowfiles per merge) and SplitText (producing more then 20,000 Flowfiles per 1 input FlowFile) are common processors that result in higher heap usage. Try using two of these processors in series to merge smaller batches and merge those generated batches again. Try using two SplitText processor to perform a two phase split to get to the same result. The FlowFile attributes for every FlowFile being created in a single thread by these processors is held in heap until they are committed to a connection queue all at the same time. Only once they are on a connection will the NiFi controller take care of swapping them to disk to help with heap usage.
Hope the above at least gives you some direction....