Long time reader, first time poster - please be gentle....
What exactly does this error message mean?
REST API (Cluster): 5,069ms (WARNING)
Clearly, it's a connection latency threshold warning, but it's not obvious to me what the endpoints are - and Google isn't being kind to me this morning.
That measures the time it takes to retrieve and serialize information about the cluster - basically what the web client is doing every few seconds. Depending on cluster size, 5 seconds might not be too outrageous. Can you specify how many hosts you have?
I think there is a bug out right now where a large number of configuration versions (like if you have 200 versions of hdfs-site) can cause this number to be artificially high.
If your cluster is running well, then you can change this threshold value to something a little higher.
We have an Ambari server + 3 "master nodes" (running things like NameNodeHA, HBase master services, and so on) + 19 "worker nodes" (HDFS data nodes, HBase region servers, YARN node managers, etc).
For HDFS, we only have one config group (default) and 26 config versions; I haven't done a full analysis, but I would be surprised if we have more than 9 or 10 hdfs-site versions in that mix.
Ambari seems to be responsive and we don't see any ridiculous or unexpected delays in page loads, including Metrics widgets.
The threshold for WARNING I believe is 5 seconds - you're just tipping over the edge of that in your post.
- Does it stay at WARNING, or does it go away?
- Historically, what has the value of that alert been in your cluster? Some clusters hover at around 1-2, seconds, others higher.
If the cluster is responsive, you can feel free to increase this value to something like 7 seconds for WARNING and 10 for CRITICAL. Typically, this value has a direct correlation to responsiveness of the UI.