After upgrading to CDH/CM 5.4.5, we're seeing the following health test concerning on all 24 of our SolrServer nodes:
The web server of this role is responding with metrics. The most recent collection took 50 second(s). Warning threshold: 10 second(s).
Has anyone else noticed this test in particular having issues following an upgrade? We were previously running 5.4.1 without issue.
There was a metrics related change in CDH5.4.5, see https://issues.apache.org/jira/browse/SOLR-7458 for more details. Seems plausible that that could be the cause of the issue you are seeing. We'll investigate and get back to you.
We have been able to reproduce on a large cluster in house, but don't have a fix yet. We have made a change to make the metrics changes configurable and to disable them by default. That should be available in a future release of C5.4.x and CDH5.5.0. I'll let you know once we have a proper fix.
On relaxing the SLA on that test in CM, we didn't expect the metrics change to have such a large impact. I think the correct thing to do is attack the problem from that perspective rather than relaxing the SLA in CM.