This issue cropped up again, this time though I'm running CDH 5.6 in a 2 node vSphere ESXi-based cluster. Each node has 20GB of memory. Also different this time is a Management Service Restart was successfull.
This came after finally getting everything installed and started yesterday afternoon (a 3 node config failed multiple times to start all the services), and then this morning all the monitoring services had stopped.
What could cause this?
What action triggered the stacktrace? The stacktrace is from deep within Spring and suggests system level issue, e.g. out of memory. A few things to check:
- server log (/var/log/cloudera-scm-server/cloudera-scm-server.log)
- management daemon logs (/var/log/cloudera-scm-firehose/*.log
- check "Hosts"->"All Hosts" for memory pressure. The "Resources" tab of individual Host page may help as well
I took a brief look at your issue. If Host Monitor is not running, then you can check the stderr.log and stdout.log for clues if it is running out of heap (in which case the default behavior is to shutdown the service).
In Cloudera Manager, go to Clusters --> Cloudera Management Service --> (under Status Summary) Host Monitor --> Processes
Click on the "stdout" link to open the stdout for the process. See if there are OOMEs there.
Regardless of how much RAM you have on this host, if the Host Monitor's max heap is tuned too low, it can happen that it will throw an OOME exception.