Created on 11-18-2016 09:57 AM
PROBLEM STATEMENT:
Running the python /usr/lib/python2.6/site-packages/ambari_agent/HostCleanup.py on the ambari metrics server I get every service running again, except the ambar-metrics-collector: The process is running, but there are two alerts left: Metrics Collector Process - Connection failed: [Errno 111] Connection refused to XXXXX:6188
Metrics Collector - HBase Master Process - Connection failed: [Errno 111] Connection refused to XXXXXX:61310
ERROR:
0x15741734f740003, negotiated timeout = 120000 07:57:42,262 INFO [main] ZooKeeperRegistry:107 - ClusterId read in ZooKeeper is null 07:57:42,341 WARN [main] HeapMemorySizeUtil:55 - hbase.regionserver.global.memstore.upperLimit is deprecated by hbase.regionserver.global.memstore.size 07:58:13,170 INFO [main-SendThread(localhost:61181)] ClientCnxn:1142 - Unable to read additional data from server sessionid 0x15741734f740001, likely server has closed socket, closing socket connection and attempting reconnect 07:58:13,170 INFO [main-SendThread(localhost:61181)] ClientCnxn:1142 - Unable to read additional data from server sessionid 0x15741734f740003, likely server has closed socket, closing socket connection and attempting reconnect 07:58:14,381 INFO [main-SendThread(localhost:61181)] ClientCnxn:1019 - Opening socket connection to server localhost/127.0.0.1:61181. Will not attempt to authenticate using SASL (unknown error) 07:58:14,382 WARN [main-SendThread(localhost:61181)] ClientCnxn:1146 - Session 0x15741734f740001 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 07:58:14,961 INFO [main-SendThread(localhost:61181)] ClientCnxn:1019 - Opening socket connection to server localhost/127.0.0.1:61181. Will not attempt to authenticate using SASL (unknown error) 07:58:14,961 WARN [main-SendThread(localhost:61181)] ClientCnxn:1146 - Session 0x15741734f740003 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused
SYMPTOM:
Java Garbage Collection [GC] pauses often for the AMS master process at the same time frame when these errors are observed. To verify the same, 1. Review /var/log/ambari-metrics-collector/hbase-ams-master-<hostname>.log
2. Check if messages like the following are printed often and XXXms is larger than a few hundreds ms: "[JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately XXXms"
3. Also review gc.log-YYYYMMDDHHmm to find out which Java memory area is causing slowness
The default location is /var/log/ambari-metrics-collector/
RESOLUTION:
Increase the AMS hbase heap sizes as follows:
1. Identify the current heapsize by checking the java process settings [-Xmx, -Xmn:MaxPermSize] by running the following: ps auxwww | grep 'org.apache.hadoop.hbase.master.HMaster start'
2. Check the free memory in the system by running the following: free -t If the server has enough free memory, increase hbase_master_heapsize (and/or) the following based on the GC type identified from gc.log:
1. hbase_master_maxperm_size (and/or)
2. hbase_master_xmn_size
User | Count |
---|---|
758 | |
379 | |
316 | |
309 | |
270 |