Support Questions
Find answers, ask questions, and share your expertise

HDF 3.1 Ambari Metrics Collector doesn't start

Expert Contributor

I installed HDF 3.1 on a single server. There was no problem for a while after fresh installation then somehow I don't understand AMS Collector stopped and never starts. Following logs belong to ambari-metrics-collector:

2018-03-21 15:18:14,813 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=hadoop.datalonga.com:61181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/meta-region-server
2018-03-21 15:18:14,996 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server hadoop.datalonga.com/10.XX.XX.XX:61181. Will not attempt to authenticate using SASL (unknown error)
2018-03-21 15:18:14,997 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
	at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)

* I cleaned a couple of times "/var/lib/ambari-metrics-collector/hbase-tmp" directory but it was not usefull.

* I couldn't find any port in AMS configs like 61181

* Although there is no HBase service in HDF 3.1 stack I see some hbase related configs and directories I can't understand why?

12 REPLIES 12

Mentor

@Erkan ŞİRİN
The error "transientZooKeeper" please make sure the zookeepers are all up and running. AMS uses hbase as backend databases

In general, AMS operates in 2 modes.

embedded : Single AMS HBase daemon writing to local disk.

distributed : Separate AMS HBase master and RS daemons by default writing to HDFS.

I assume the value of your config in ams-site : "timeline.metrics.service.operation.mode" is embedded.

Please verify the config ams-hbase-site : hbase.cluster.distributed = false for embedded true for distributed

Expert Contributor

Hi @Geoffrey Shelton Okot thanks for the answer. Zookeeper is up and running, I verified with telnet

telnet hadoop.datalonga.com 2181

AMS is in embedded mode and "hbase.cluster.distributed = false"

What is the Ambari version?

If Ambari is upgraded, did you do the post upgrade task?

Could you check the value of hbase.zookeeper.property.clientPort in Ambari?

Also what is the value in "Metrics Service operation mode" in Ambari? embedded or distributed?

Could you stop AMS completely and make sure it's actually stopped by checking the process, for example "ps aux | grep ^ams", please?

Then try starting from Ambari.

Did you also give enough heap for AMS?

Expert Contributor

Hi @Hajime thanks for the answer.

1. Ambari version: 2.6.1.3-3,

2. Ambari is not upgraded it is fresh HDF 3.1 installation.

3. hbase.zookeeper.property.clientPort: {{zookeeper_clientPort}}

4. Metrics Service operation mode is embedded

5. "ps aux | grep ^ams" command result:

ams 8772 0.7 0.0 656452 11620 ? Sl 2017 1463:20 /usr/bin/python2.7 /usr/lib/python2.6/site-packages/resource_monitoring/main.py start

6. I made AMS heap from 512 MB to 4196 MB and restarted via Ambari, which requires for restart due to config change. It took long time to start but as soon as it starts it stops.

If embedded mode, could you try typing 61181 instead of {{zookeeper_clientPort}}.

BTW, is "hadoop.datalonga.com" your AMS node?

If above hostname is not your AMS node, please check "hbase.zookeeper.quorum"

Expert Contributor

I tried 61181 but didn't work. Yes my AMS node is hadoop.datalonga.com I have only one node.

hbase.zookeeper.quorum: {{zookeeper_quorum_hosts}}

Mentor

@avatar imageErkan ŞİRİN

Backup hbase.tmp.dir/zookeeper and then remove hbase.tmp.dir/zookeeper/* files and retry.

Can you follow this procedure Cleaning up AMS data

Expert Contributor

I followed the document excactly what it says. It didn't work, though this time stop took some time. I have reexamined the ams logs. They seems changed:

org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [0.0.0.0:60200] java.net.BindException: Address already in use; For more details see:  http://wiki.apache.org/hadoop/BindException

......

Caused by: java.net.BindException: Problem binding to [0.0.0.0:60200] java.net.BindException: Address already in use; For more details see:  http://wiki.apache.org/hadoop/BindException
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
....

Caused by: java.net.BindException: Address already in use
........
2018-03-22 14:42:46,635 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [0.0.0.0:60200] java.net.BindException: Address already in use; For more details see:  http://wiki.apache.org/hadoop/BindException
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [0.0.0.0:60200] java.net.BindException: Address already in use; For more details see:  http://wiki.apache.org/hadoop/BindException

.....

Caused by: java.net.BindException: Problem binding to [0.0.0.0:60200] java.net.BindException: Address already in use; For more details see:  http://wiki.apache.org/hadoop/BindException
.....

Caused by: java.net.BindException: Address already in use
.....

2018-03-22 14:42:46,960 INFO org.apache.hadoop.util.ExitUtil: Exiting with status -1
2018-03-22 14:42:46,964 INFO org.apache.helix.manager.zk.ZKHelixManager: disconnect hadoop.datalonga.com_12001(PARTICIPANT) from ambari-metrics-cluster
2018-03-22 14:42:46,964 INFO org.apache.helix.healthcheck.ParticipantHealthReportTask: Stop ParticipantHealthReportTimerTask
2018-03-22 14:42:46,964 INFO org.apache.helix.messaging.handling.HelixTaskExecutor: Shutting down HelixTaskExecutor
2018-03-22 14:42:46,964 INFO org.apache.helix.messaging.handling.HelixTaskExecutor: Reset HelixTaskExecutor
2018-03-22 14:42:46,965 INFO org.apache.helix.monitoring.mbeans.MessageQueueMonitor: Unregistering ClusterStatus: cluster=ambari-metrics-cluster,messageQueue=hadoop.datalonga.com_12001
2018-03-22 14:42:46,965 INFO org.apache.helix.messaging.handling.HelixTaskExecutor: Reset exectuor for msgType: TASK_REPLY, pool: java.util.concurrent.ThreadPoolExecutor@7a639ec5[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
2018-03-22 14:42:46,965 INFO org.apache.helix.messaging.handling.HelixTaskExecutor: Shutting down pool: java.util.concurrent.ThreadPoolExecutor@7a639ec5[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
2018-03-22 14:42:46,965 INFO org.apache.helix.messaging.handling.HelixTaskExecutor: Reset exectuor for msgType: STATE_TRANSITION, pool: java.util.concurrent.ThreadPoolExecutor@6c37bd27[Running, pool size = 4, active threads = 0, queued tasks = 0, completed tasks = 4]
2018-03-22 14:42:46,965 INFO org.apache.helix.messaging.handling.HelixTaskExecutor: Shutting down pool: java.util.concurrent.ThreadPoolExecutor@6c37bd27[Running, pool size = 4, active threads = 0, queued tasks = 0, completed tasks = 4]
2018-03-22 14:42:46,969 WARN org.apache.helix.participant.statemachine.StateModel: Default reset method invoked. Either because the process longer own this resource or session timedout
2018-03-22 14:42:46,969 WARN org.apache.helix.participant.statemachine.StateModel: Default reset method invoked. Either because the process longer own this resource or session timedout
2018-03-22 14:42:46,969 INFO org.apache.helix.monitoring.ParticipantMonitor: Registering bean: Cluster=ambari-metrics-cluster,Resource=METRIC_AGGREGATORS,Transition=OFFLINE--ONLINE
2018-03-22 14:42:46,969 INFO org.apache.helix.messaging.handling.HelixTaskExecutor: Shutdown HelixTaskExecutor finished
2018-03-22 14:42:46,969 INFO org.apache.helix.manager.zk.CallbackHandler: 199 START:INVOKE /ambari-metrics-cluster/INSTANCES/hadoop.datalonga.com_12001/MESSAGES listener:org.apache.helix.messaging.handling.HelixTaskExecutor
2018-03-22 14:42:46,969 INFO org.apache.helix.manager.zk.CallbackHandler: hadoop.datalonga.com_12001 unsubscribe child-change. path: /ambari-metrics-cluster/INSTANCES/hadoop.datalonga.com_12001/MESSAGES, listener: org.apache.helix.messaging.handling.HelixTaskExecutor@49aa766b
2018-03-22 14:42:46,970 INFO org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down ApplicationHistoryServer at hadoop.datalonga.com/10.XX.XX.XX
************************************************************/

Expert Contributor

Hi again. When I checked half an hour later, suprisingly Ambari Metrics was working, and I saw amazing green thick beside it :). I didn't do aything but @Geoffrey Shelton Okot's last suggestion AMS Clean instructions. So, case closed for now. Thanks a lot again @Hajime and @Geoffrey Shelton Okot

Mentor

@Erkan ŞİRİN

Great to know all is working for you. I don't think it started working without your touch 🙂

Try to recall all that you did. I am certain you did clean the directories but MAYBE didn't kill the AMS process so when the process died.you somehow restarted the AMS et voila !!!

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.