Created 03-21-2018 02:35 PM
I installed HDF 3.1 on a single server. There was no problem for a while after fresh installation then somehow I don't understand AMS Collector stopped and never starts. Following logs belong to ambari-metrics-collector:
2018-03-21 15:18:14,813 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=hadoop.datalonga.com:61181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/meta-region-server 2018-03-21 15:18:14,996 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server hadoop.datalonga.com/10.XX.XX.XX:61181. Will not attempt to authenticate using SASL (unknown error) 2018-03-21 15:18:14,997 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
* I cleaned a couple of times "/var/lib/ambari-metrics-collector/hbase-tmp" directory but it was not usefull.
* I couldn't find any port in AMS configs like 61181
* Although there is no HBase service in HDF 3.1 stack I see some hbase related configs and directories I can't understand why?
Created 03-21-2018 04:09 PM
@Erkan ŞİRİN
The error "transientZooKeeper" please make sure the zookeepers are all up and running. AMS uses hbase as backend databases
In general, AMS operates in 2 modes.
embedded : Single AMS HBase daemon writing to local disk.
distributed : Separate AMS HBase master and RS daemons by default writing to HDFS.
I assume the value of your config in ams-site : "timeline.metrics.service.operation.mode" is embedded.
Please verify the config ams-hbase-site : hbase.cluster.distributed = false for embedded true for distributed
Created 03-22-2018 08:58 AM
Hi @Geoffrey Shelton Okot thanks for the answer. Zookeeper is up and running, I verified with telnet
telnet hadoop.datalonga.com 2181
AMS is in embedded mode and "hbase.cluster.distributed = false"
Created 03-21-2018 11:14 PM
What is the Ambari version?
If Ambari is upgraded, did you do the post upgrade task?
Could you check the value of hbase.zookeeper.property.clientPort in Ambari?
Also what is the value in "Metrics Service operation mode" in Ambari? embedded or distributed?
Could you stop AMS completely and make sure it's actually stopped by checking the process, for example "ps aux | grep ^ams", please?
Then try starting from Ambari.
Did you also give enough heap for AMS?
Created 03-22-2018 09:18 AM
Hi @Hajime thanks for the answer.
1. Ambari version: 2.6.1.3-3,
2. Ambari is not upgraded it is fresh HDF 3.1 installation.
3. hbase.zookeeper.property.clientPort: {{zookeeper_clientPort}}
4. Metrics Service operation mode is embedded
5. "ps aux | grep ^ams" command result:
ams 8772 0.7 0.0 656452 11620 ? Sl 2017 1463:20 /usr/bin/python2.7 /usr/lib/python2.6/site-packages/resource_monitoring/main.py start
6. I made AMS heap from 512 MB to 4196 MB and restarted via Ambari, which requires for restart due to config change. It took long time to start but as soon as it starts it stops.
Created 03-22-2018 09:20 AM
If embedded mode, could you try typing 61181 instead of {{zookeeper_clientPort}}.
Created 03-22-2018 09:20 AM
BTW, is "hadoop.datalonga.com" your AMS node?
Created 03-22-2018 09:23 AM
If above hostname is not your AMS node, please check "hbase.zookeeper.quorum"
Created 03-22-2018 12:27 PM
I tried 61181 but didn't work. Yes my AMS node is hadoop.datalonga.com I have only one node.
hbase.zookeeper.quorum: {{zookeeper_quorum_hosts}}
Created 03-22-2018 09:26 AM
Backup hbase.tmp.dir/zookeeper and then remove hbase.tmp.dir/zookeeper/* files and retry.
Can you follow this procedure Cleaning up AMS data
Created 03-22-2018 12:59 PM
I followed the document excactly what it says. It didn't work, though this time stop took some time. I have reexamined the ams logs. They seems changed:
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [0.0.0.0:60200] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException ...... Caused by: java.net.BindException: Problem binding to [0.0.0.0:60200] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) .... Caused by: java.net.BindException: Address already in use ........ 2018-03-22 14:42:46,635 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [0.0.0.0:60200] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [0.0.0.0:60200] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException ..... Caused by: java.net.BindException: Problem binding to [0.0.0.0:60200] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException ..... Caused by: java.net.BindException: Address already in use ..... 2018-03-22 14:42:46,960 INFO org.apache.hadoop.util.ExitUtil: Exiting with status -1 2018-03-22 14:42:46,964 INFO org.apache.helix.manager.zk.ZKHelixManager: disconnect hadoop.datalonga.com_12001(PARTICIPANT) from ambari-metrics-cluster 2018-03-22 14:42:46,964 INFO org.apache.helix.healthcheck.ParticipantHealthReportTask: Stop ParticipantHealthReportTimerTask 2018-03-22 14:42:46,964 INFO org.apache.helix.messaging.handling.HelixTaskExecutor: Shutting down HelixTaskExecutor 2018-03-22 14:42:46,964 INFO org.apache.helix.messaging.handling.HelixTaskExecutor: Reset HelixTaskExecutor 2018-03-22 14:42:46,965 INFO org.apache.helix.monitoring.mbeans.MessageQueueMonitor: Unregistering ClusterStatus: cluster=ambari-metrics-cluster,messageQueue=hadoop.datalonga.com_12001 2018-03-22 14:42:46,965 INFO org.apache.helix.messaging.handling.HelixTaskExecutor: Reset exectuor for msgType: TASK_REPLY, pool: java.util.concurrent.ThreadPoolExecutor@7a639ec5[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] 2018-03-22 14:42:46,965 INFO org.apache.helix.messaging.handling.HelixTaskExecutor: Shutting down pool: java.util.concurrent.ThreadPoolExecutor@7a639ec5[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] 2018-03-22 14:42:46,965 INFO org.apache.helix.messaging.handling.HelixTaskExecutor: Reset exectuor for msgType: STATE_TRANSITION, pool: java.util.concurrent.ThreadPoolExecutor@6c37bd27[Running, pool size = 4, active threads = 0, queued tasks = 0, completed tasks = 4] 2018-03-22 14:42:46,965 INFO org.apache.helix.messaging.handling.HelixTaskExecutor: Shutting down pool: java.util.concurrent.ThreadPoolExecutor@6c37bd27[Running, pool size = 4, active threads = 0, queued tasks = 0, completed tasks = 4] 2018-03-22 14:42:46,969 WARN org.apache.helix.participant.statemachine.StateModel: Default reset method invoked. Either because the process longer own this resource or session timedout 2018-03-22 14:42:46,969 WARN org.apache.helix.participant.statemachine.StateModel: Default reset method invoked. Either because the process longer own this resource or session timedout 2018-03-22 14:42:46,969 INFO org.apache.helix.monitoring.ParticipantMonitor: Registering bean: Cluster=ambari-metrics-cluster,Resource=METRIC_AGGREGATORS,Transition=OFFLINE--ONLINE 2018-03-22 14:42:46,969 INFO org.apache.helix.messaging.handling.HelixTaskExecutor: Shutdown HelixTaskExecutor finished 2018-03-22 14:42:46,969 INFO org.apache.helix.manager.zk.CallbackHandler: 199 START:INVOKE /ambari-metrics-cluster/INSTANCES/hadoop.datalonga.com_12001/MESSAGES listener:org.apache.helix.messaging.handling.HelixTaskExecutor 2018-03-22 14:42:46,969 INFO org.apache.helix.manager.zk.CallbackHandler: hadoop.datalonga.com_12001 unsubscribe child-change. path: /ambari-metrics-cluster/INSTANCES/hadoop.datalonga.com_12001/MESSAGES, listener: org.apache.helix.messaging.handling.HelixTaskExecutor@49aa766b 2018-03-22 14:42:46,970 INFO org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down ApplicationHistoryServer at hadoop.datalonga.com/10.XX.XX.XX ************************************************************/
Created 03-22-2018 02:39 PM
Hi again. When I checked half an hour later, suprisingly Ambari Metrics was working, and I saw amazing green thick beside it :). I didn't do aything but @Geoffrey Shelton Okot's last suggestion AMS Clean instructions. So, case closed for now. Thanks a lot again @Hajime and @Geoffrey Shelton Okot
Created 03-22-2018 02:59 PM
Great to know all is working for you. I don't think it started working without your touch 🙂
Try to recall all that you did. I am certain you did clean the directories but MAYBE didn't kill the AMS process so when the process died.you somehow restarted the AMS et voila !!!