Running the python /usr/lib/python2.6/site-packages/ambari_agent/HostCleanup.py on the ambari metrics server
I get every service running again, except the ambar-metrics-collector:
The process is running, but there are two alerts left:
Metrics Collector Process - Connection failed: [Errno 111] Connection refused to XXXXX:6188
Metrics Collector - HBase Master Process - Connection failed: [Errno 111] Connection refused to XXXXXX:61310
ERROR:
0x15741734f740003, negotiated timeout = 120000
07:57:42,262 INFO [main] ZooKeeperRegistry:107 - ClusterId read in ZooKeeper is null
07:57:42,341 WARN [main] HeapMemorySizeUtil:55 - hbase.regionserver.global.memstore.upperLimit is deprecated by hbase.regionserver.global.memstore.size
07:58:13,170 INFO [main-SendThread(localhost:61181)] ClientCnxn:1142 - Unable to read additional data from server sessionid 0x15741734f740001, likely server has closed socket, closing socket connection and attempting reconnect
07:58:13,170 INFO [main-SendThread(localhost:61181)] ClientCnxn:1142 - Unable to read additional data from server sessionid 0x15741734f740003, likely server has closed socket, closing socket connection and attempting reconnect
07:58:14,381 INFO [main-SendThread(localhost:61181)] ClientCnxn:1019 - Opening socket connection to server localhost/127.0.0.1:61181. Will not attempt to authenticate using SASL (unknown error)
07:58:14,382 WARN [main-SendThread(localhost:61181)] ClientCnxn:1146 - Session 0x15741734f740001 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
07:58:14,961 INFO [main-SendThread(localhost:61181)] ClientCnxn:1019 - Opening socket connection to server localhost/127.0.0.1:61181. Will not attempt to authenticate using SASL (unknown error)
07:58:14,961 WARN [main-SendThread(localhost:61181)] ClientCnxn:1146 - Session 0x15741734f740003 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
SYMPTOM:
Java Garbage Collection [GC] pauses often for the AMS master process at the same time frame when these errors are observed.
To verify the same,
1. Review /var/log/ambari-metrics-collector/hbase-ams-master-<hostname>.log
2. Check if messages like the following are printed often and XXXms is larger than a few hundreds ms:
"[JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately XXXms"
3. Also review gc.log-YYYYMMDDHHmm to find out which Java memory area is causing slowness
The default location is /var/log/ambari-metrics-collector/
RESOLUTION:
Increase the AMS hbase heap sizes as follows:
1. Identify the current heapsize by checking the java process settings [-Xmx, -Xmn:MaxPermSize] by running the following:
ps auxwww | grep 'org.apache.hadoop.hbase.master.HMaster start'
2. Check the free memory in the system by running the following:
free -t
If the server has enough free memory, increase hbase_master_heapsize (and/or) the following based on the GC type identified from gc.log: