Created on 01-22-2021 02:54 AM - edited on 01-29-2021 12:34 AM by subratadas
Ambari Metrics System (AMS) collects, aggregates, and serves Hadoop and system metrics in Ambari-managed clusters. Basically AMS has four components Metrics Monitors, Hadoop Sinks, Metrics Collector, and Grafana.
Metrics Monitors on each host in the cluster collect system-level metrics and publish to the Metrics Collector.
Hadoop Sinks plug in to Hadoop components to publish Hadoop metrics to the Metrics Collector.
The Metrics Collector is a daemon that runs on a specific host in the cluster and receives data from the registered publishers, the Monitors and the Sinks.
Grafana is a daemon that runs on a specific host in the cluster and serves pre-built dashboards for visualizing metrics collected in the Metrics Collector.
In this article, we will be checking how we can troubleshoot AMS-related issues effectively.
There are multiple issues which arises in AMS that leads to different types of discrepancy such as collector crashing, Metrics not available, Grafana startup failure and time-range metric issues in Grafana. We will be checking the step by step process for multiple issues in AMS environment and how we can troubleshoot them effectively.
Issues Arises in AMS
Collector Not coming up or crashing frequently
This is the most general problem with the AMC. There could be multiple reasons for the frequent or intermittent crash of AMC. Here is how we need to approach this step by step and debug to resolve the issue.
[root@c1236-node4 ~]# rpm -qa|grep ambari-metrics
ambari-metrics-monitor-2.7.5.0-72.x86_64
ambari-metrics-collector-2.7.5.0-72.x86_64
ambari-metrics-hadoop-sink-2.7.5.0-72.x86_64
If not, please upgrade the AMS accordingly. Follow Upgrading Ambari Metrics for the upgrade process.
2018-03-21 15:18:14,996 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server hadoop.datalonga.com/10.XX.XX.XX:61181. Will not attempt to authenticate using SASL (unknown error)
2018-03-21 15:18:14,997 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
INFO org.apache.hadoop.yarn.webapp.WebApps: Registered webapp guice modules 2017-08-04 15:30:36,965 INFO org.apache.hadoop.http.HttpServer2: HttpServer.start() threw a non Bind IOException java.net.BindException: Port in use: xxxxxx:6188at org.apache.hadoop.http.HttpServer2.openListeners(HttpServer2.java:919) at
In such scenario,
Most of the cases, we will observe the ASYNC_PROCESS logging in AMC log. In that case, check the hbase-ams-master-<hostname>.log [ EMBEDDED Mode ] and hbase-ams-regionserver-<hostname>.log file [ DISTRIBUTED Mode]. You will observe the following log lines frequently in these logs.
WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 4058ms GC pool 'ParNew' had collection(s): count=1 time=2415ms
WARN [RpcServer.FifoWFPBQ.default.handler=32,queue=2,port=61320] ipc.RpcServer: (responseTooSlow): {"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiRequest)",
"starttimems":1609769073845,"responsesize":739217,"method":"Multi","processingtimems":10003,"client":"10.118.5.94:51114","queuetimems":0,"class":"HRegionServer"}
WARN [7,queue=0,port=16020] regionserver.RSRpcServices - Large batch operation detected
(greater than 5000) (HBASE-18023). Requested Number of Rows: 7096 Client: xxx
In such cases, Check if services are generating way more than metrics which AMC is not able to handle at the current configuration. To rectify these type of issues, we can use below pointers to stabilize the collector.
ERROR [RpcServer.FifoWFPBQ.priority.handler=19,queue=1,port=61320] regionserver.HRegionServer: Received CLOSE for a region which is not online, and we're not opening.
To mitigate such issues, try the following steps and check the status of AMC.
Connect with zookeeper CLI->
/usr/lib/ams-hbase/bin/hbase --config /etc/ams-hbase/conf zkcli
Remove all the znode->
rmr /ams-hbase-secure
Create a new znode in AMS configuration -> zookeeper.znode.parent and restart AMS
Metrics not available in Ambari UI > If metrics are not showing on Ambari UI, then check if there is any present issues with the collector. I am providing some of the known issues for the latest version of Ambari where graphs are not available even though AMC is running fine.
In latest version of Ambari-2.7.3 and above, we have observed the NIFI, Ambari_Metrics and Kafka
does not show the metrics. Here are the workaround to mitigate the issue.
a) NIFI->
1) vi /var/lib/ambari-server/resources/common-services/NIFI/1.0.0/metainfo.xml
Change <timelineAppid>NIFI</timelineAppid> to <timelineAppid>nifi</timelineAppid> at two places.
2) Replace "<timelineAppid>NIFI</timelineAppid>" with "<timelineAppid>nifi</timelineAppid>" in /var/lib/ambari-server/resources/mpacks/ hdf-ambari-mpack-3.5.1.*/common-services/NIFI/1.0.0/metainfo.xml file
b) Ambari_Metrics:->
Replace "<timelineAppid>AMS-HBASE</timelineAppid>" with "<timelineAppid>ams-hbase</timelineAppid>" in /var/lib/ambari-server/resources/mpacks/hdf-ambari-mpack-3.5.1.*/stacks/HDF/3.2.b/services/AMBARI_METRICS/metainfo.xml file
c) Kafka:->
add "<timelineAppid>kafka_broker</timelineAppid>" after "<name>KAFKA_BROKER</name>" in /var/lib/ambari-server/resources/mpacks/hdf-ambari-mpack-3.5.1.*/stacks/HDF/3.3/services/KAFKA/metainfo.xml file
After above Change restart ambari-server.
Grafana Metric related issues > Sometimes we see multiple issues with the Grafana such as Metrics not available, Time-range metrics are not being shown, inaccurate information in graphs. To check the Grafana issues, perform the following sequence.
2021-01-22 09:22:23,051 INFO TimelineMetricHostAggregatorMinute: End aggregation cycle @ Fri Jan 22 09:22:23 UTC 2021
2021-01-22 09:23:20,780 INFO TimelineClusterAggregatorSecond: Started Timeline aggregator thread @ Fri Jan 22 09:23:20 UTC 2021
2021-01-22 09:23:20,784 INFO TimelineClusterAggregatorSecond: Last Checkpoint read : Fri Jan 22 09:20:00 UTC 2021
2021-01-22 09:23:20,784 INFO TimelineClusterAggregatorSecond: Rounded off checkpoint : Fri Jan 22 09:20:00 UTC 2021
2021-01-22 09:23:20,784 INFO TimelineClusterAggregatorSecond: Last check point time: 1611307200000, lagBy: 200 seconds.
2021-01-22 09:23:20,784 INFO TimelineClusterAggregatorSecond: Start aggregation cycle @ Fri Jan 22 09:23:20 UTC 2021, startTime = Fri Jan 22 09:20:00 UTC 2021, endTime = Fri Jan 22 09:22:00 UTC 2021
2021-01-22 09:23:20,784 INFO TimelineClusterAggregatorSecond: Skipping aggregation for metric patterns : sdisk\_%,boottime
2021-01-22 09:23:23,129 INFO TimelineClusterAggregatorSecond: Saving 23764 metric aggregates.
2021-01-22 09:23:23,868 INFO TimelineClusterAggregatorSecond: End aggregation cycle @ Fri Jan 22 09:23:23 UTC 2021
2. If data is not present in all the graphs then check for the metadata output and look for the services metrics in meta output.
3. If data is not present in only few graphs then please check the meta-output for that particular service.
Example:
If data is not present over hive-server2 metrics:
curl -v --insecure https://<ams-host>:6188/ws/v1/timeline/metrics/metadata?appId=hiveserver2 | python -m json.tool > hiveserver2.txt
4. Check if any whitelisting or blacklisting has been applied in the configuration, which might be stopping AMS to process those metrics. Also check for
any configuration such as "Disable Minute host aggregator" is enabled in configs.
5. There are some known issues on the latest version of Ambari over grafana. I am listing few of them here.
AMBARI-25570
AMBARI-25563
AMBARI-25383
AMBARI-25457
Happy Learning!!!!!!!