Support Questions
Find answers, ask questions, and share your expertise

ambari on HDP cluster + ambari-metrics-collector service not start

ambari on HDP cluster + ambari-metrics-collector service not start

 
 

we have some issue with ambari-metrics-collector service , ( we have HDP cluster version - 2.6.4 with 8 nodes )

ambari metrics collector service can’t start or start of few second then failed

 

the details about metrics collector version

rpm -qa | grep metrics
ambari-metrics-grafana-2.6.1.0-143.x86_64
ambari-metrics-monitor-2.6.1.0-143.x86_64
ambari-metrics-collector-2.5.0.3-7.x86_64
ambari-metrics-hadoop-sink-2.6.1.0-143.x86_64

all machines are rhel 7.2

we performed the following steps in order to resolve the problem

1.restart metrics-collector service

su - ams -c '/usr/sbin/ambari-metrics-collector --config /etc/ambari-metrics-collector/conf/ stop'
su - ams -c '/usr/sbin/ambari-metrics-collector --config /etc/ambari-metrics-collector/conf/ start'

or

ambari-metrics-collector stop 
ambari-metrics-collector start

2.restart ambari-metrics-monitor on all nodes

 ambari-metrics-monitor stop
 ambari-metrics-monitor start

3.clean the folder /var/lib/ambari-metrics-collector/hbase-tmp/zookeeper/

mv /var/lib/ambari-metrics-collector/hbase-tmp/zookeeper/zookeeper_0 /tmp/bck/zookeeper/

Then restart metrics-collector service

4.Tuning the metrics-collector parameters according - https://docs.cloudera.com/HDPDocuments/Ambari-2.2.1.0/bk_ambari_reference_guide/content/_ams_general...

we update the follwing parameters in ambari

metrics_collector_heap_size=1024
hbase_regionserver_heapsize=1024
hbase_master_heapsize=512
hbase_master_xmn_size=128

status for now: - steps 1-4 doesn’t help

From the logs we can see the following:

log file - ambari-metrics-collector.log

2020-06-25 09:06:14,474 WARN org.apache.zookeeper.ClientCnxn: Session 0x172eab71f310002 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
2020-06-25 09:06:14,575 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=master02.sys671.com:61181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/meta-region-server

log file - hbase-ams-master-master02.sys671.com.log

2020-06-25 09:38:18,799 WARN  [RS:0;master02:51842-SendThread(master02.sys671.com:61181)] zookeeper.ClientCnxn: Session 0x172ead5d73a0004 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2020-06-25 09:38:20,437 INFO  [main-SendThread(master02.sys671.com:61181)] zookeeper.ClientCnxn: Opening socket connection to server master02.sys671.com/23.2.35.171:61181. Will not attempt to authenticate using SASL (unknown error)
2020-06-25 09:38:20,438 WARN  [main-SendThread(master02.sys671.com:61181)] zookeeper.ClientCnxn: Session 0x172ead5d73a0002 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused

we also not see that port is listening ( timeline.metrics.service.webapp.address )

netstat -tulpn  | grep  6188

any advice how to continue from this point ?

we'll appreciate to get any help about this problem

Michael-Bronson