Community Articles
Find and share helpful community-sourced technical articles
Labels (1)

PROBLEM STATEMENT:

Running the python /usr/lib/python2.6/site-packages/ambari_agent/HostCleanup.py on the ambari metrics server I get every service running again, except the ambar-metrics-collector: The process is running, but there are two alerts left: Metrics Collector Process - Connection failed: [Errno 111] Connection refused to XXXXX:6188

Metrics Collector - HBase Master Process - Connection failed: [Errno 111] Connection refused to XXXXXX:61310

ERROR:

0x15741734f740003, negotiated timeout = 120000
07:57:42,262  INFO [main] ZooKeeperRegistry:107 - ClusterId read in ZooKeeper is null
07:57:42,341  WARN [main] HeapMemorySizeUtil:55 - hbase.regionserver.global.memstore.upperLimit is deprecated by hbase.regionserver.global.memstore.size
07:58:13,170  INFO [main-SendThread(localhost:61181)] ClientCnxn:1142 - Unable to read additional data from server sessionid 0x15741734f740001, likely server has closed socket, closing socket connection and attempting reconnect
07:58:13,170  INFO [main-SendThread(localhost:61181)] ClientCnxn:1142 - Unable to read additional data from server sessionid 0x15741734f740003, likely server has closed socket, closing socket connection and attempting reconnect
07:58:14,381  INFO [main-SendThread(localhost:61181)] ClientCnxn:1019 - Opening socket connection to server localhost/127.0.0.1:61181. Will not attempt to authenticate using SASL (unknown error)
07:58:14,382  WARN [main-SendThread(localhost:61181)] ClientCnxn:1146 - Session 0x15741734f740001 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
07:58:14,961  INFO [main-SendThread(localhost:61181)] ClientCnxn:1019 - Opening socket connection to server localhost/127.0.0.1:61181. Will not attempt to authenticate using SASL (unknown error)
07:58:14,961  WARN [main-SendThread(localhost:61181)] ClientCnxn:1146 - Session 0x15741734f740003 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused

SYMPTOM:

Java Garbage Collection [GC] pauses often for the AMS master process at the same time frame when these errors are observed. To verify the same, 1. Review /var/log/ambari-metrics-collector/hbase-ams-master-<hostname>.log

2. Check if messages like the following are printed often and XXXms is larger than a few hundreds ms: "[JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately XXXms"

3. Also review gc.log-YYYYMMDDHHmm to find out which Java memory area is causing slowness

The default location is /var/log/ambari-metrics-collector/

RESOLUTION:

Increase the AMS hbase heap sizes as follows:

1. Identify the current heapsize by checking the java process settings [-Xmx, -Xmn:MaxPermSize] by running the following: ps auxwww | grep 'org.apache.hadoop.hbase.master.HMaster start'

2. Check the free memory in the system by running the following: free -t If the server has enough free memory, increase hbase_master_heapsize (and/or) the following based on the GC type identified from gc.log:

1. hbase_master_maxperm_size (and/or)

2. hbase_master_xmn_size

771 Views