Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Rising Star

Ambari Metrics System (AMS) collects, aggregates, and serves Hadoop and system metrics in Ambari-managed clusters. Basically AMS has four components Metrics Monitors, Hadoop Sinks, Metrics Collector, and Grafana. 

Metrics Monitors on each host in the cluster collect system-level metrics and publish to the Metrics Collector.

Hadoop Sinks plug in to Hadoop components to publish Hadoop metrics to the Metrics Collector.

The Metrics Collector is a daemon that runs on a specific host in the cluster and receives data from the registered publishers, the Monitors and the Sinks.

Grafana is a daemon that runs on a specific host in the cluster and serves pre-built dashboards for visualizing metrics collected in the Metrics Collector.

 

In this article, we will be checking how we can troubleshoot AMS-related issues effectively.

There are multiple issues which arises in AMS that leads to different types of discrepancy such as collector crashing, Metrics not available, Grafana startup failure and time-range metric issues in Grafana. We will be checking the step by step process for multiple issues in AMS environment and how we can troubleshoot them effectively.

Issues Arises in AMS

  • Collector Not coming up or crashing frequently
  • Metrics Not available in Ambari UI
  • Grafana Metric related issues

Collector Not coming up or crashing frequently

This is the most general problem with the AMC. There could be multiple reasons for the frequent or intermittent crash of AMC. Here is how we need to approach this step by step and debug to resolve the issue.

  • First check for Binaries of AMS which should be identical as the current Ambari version. [ Verify on all the hosts where metrics monitor is running ]

 

[root@c1236-node4 ~]# rpm -qa|grep ambari-metrics
ambari-metrics-monitor-2.7.5.0-72.x86_64
ambari-metrics-collector-2.7.5.0-72.x86_64
ambari-metrics-hadoop-sink-2.7.5.0-72.x86_64​

 

If not, please upgrade the AMS accordingly. Follow Upgrading Ambari Metrics for the upgrade process.

  • If Zookeeper-related issues being observed in AMC logs and getting below logging.

 

2018-03-21 15:18:14,996 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server hadoop.datalonga.com/10.XX.XX.XX:61181. Will not attempt to authenticate using SASL (unknown error)

2018-03-21 15:18:14,997 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect

java.net.ConnectException: Connection refused

at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)

at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)

at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)​

 

 

  1. Check collector-gc.log and gc.log files. If heap space is getting occupied completely or almost  full. [ Underlined part ]. If similar is getting observed, we need to increase the heap space for    collector.
    2020-09-28T06:27:47.846+0000: 503090.803: [GC (Allocation Failure) 2020-09-                                        28T06:27:47.846+0000: 503090.804: [ParNew: 145749K->1896K(157248K), 0.0099317 secs]                506788K->506400K(506816K), 0.0103397 secs] [Times: user=0.13 sys=0.03, real=0.01 secs]
  2. We can also clear up Zookeeper data and later restart AMS from Ambari. For more                        information, check Cleaning up Ambari Metrics System Data
  • Sometimes in logs we see below error which tells that the default port 6188 is being already occupied.

 

INFO org.apache.hadoop.yarn.webapp.WebApps: Registered webapp guice modules 2017-08-04 15:30:36,965 INFO org.apache.hadoop.http.HttpServer2: HttpServer.start() threw a non Bind IOException java.net.BindException: Port in use: xxxxxx:6188at org.apache.hadoop.http.HttpServer2.openListeners(HttpServer2.java:919) at​

 

In such scenario,

  1. Check which process is using that port before start of AMC.
    netstat -tnlpa | grep 6188
  2. Go to Ambari UI > Ambari Metrics > Configs (tab) > Advanced (child tab) > navigate to Advanced ams-site and search for the following property:
    timeline.metrics.service.webapp.address = 0.0.0.0:6188
    You can change that port to something else to avoid port conflict. then try restarting AMS.
  • Most of the cases, we will observe the ASYNC_PROCESS logging in AMC log. In that case, check the hbase-ams-master-<hostname>.log [ EMBEDDED Mode ] and hbase-ams-regionserver-<hostname>.log file [ DISTRIBUTED Mode]. You will observe the following log lines frequently in these logs.

 

WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 4058ms GC pool 'ParNew' had collection(s): count=1 time=2415ms
WARN [RpcServer.FifoWFPBQ.default.handler=32,queue=2,port=61320] ipc.RpcServer: (responseTooSlow): {"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiRequest)",

"starttimems":1609769073845,"responsesize":739217,"method":"Multi","processingtimems":10003,"client":"10.118.5.94:51114","queuetimems":0,"class":"HRegionServer"}
WARN [7,queue=0,port=16020] regionserver.RSRpcServices - Large batch operation detected
         (greater than 5000) (HBASE-18023). Requested Number of Rows: 7096 Client: xxx

 

In such cases, Check if services are generating way more than metrics which AMC is not able to handle at the current configuration. To rectify these type of issues, we can use below pointers to stabilize the collector.

  1. Check the metadata output and identify which all services are generating (>15k-20k) metrics.
    http://<ams-host>:6188/ws/v1/timeline/metrics/metadata
  2. We can increase the size of heap space of region-server (In case of Distributed) and hbase-        master (In case of Embedded). [ Check the gc.log and collector.log file to help understand the current GC utilization].
  3. If the services are generating large number of metrics, we can limit them by implementing                whitelisting or blacklisting and check if AMS is getting stabilized.

    Ambari Metrics - Whitelisting

    In some cases, we will be seeing the following logs in region-server logs, which indicate that there is some issue with the region-server opening.

 

ERROR [RpcServer.FifoWFPBQ.priority.handler=19,queue=1,port=61320] regionserver.HRegionServer: Received CLOSE for a region which is not online, and we're not opening.

 

To mitigate such issues, try the following steps and check the status of AMC.

 

Connect with zookeeper CLI->  
/usr/lib/ams-hbase/bin/hbase --config /etc/ams-hbase/conf zkcli

Remove all the znode->
rmr /ams-hbase-secure

Create a new znode in AMS configuration -> zookeeper.znode.parent and restart AMS

 

Metrics not available in Ambari UI > If metrics are not showing on Ambari UI, then check if there is any present issues with the collector. I am providing some of the known issues for the latest version of Ambari where graphs are not available even though AMC is running fine.

 

In latest version of Ambari-2.7.3 and above, we have observed the NIFI, Ambari_Metrics and Kafka
does not show the metrics. Here are the workaround to mitigate the issue.

 

a) NIFI->

1) vi /var/lib/ambari-server/resources/common-services/NIFI/1.0.0/metainfo.xml
Change <timelineAppid>NIFI</timelineAppid> to <timelineAppid>nifi</timelineAppid> at two places.

2) Replace "<timelineAppid>NIFI</timelineAppid>" with "<timelineAppid>nifi</timelineAppid>" in /var/lib/ambari-server/resources/mpacks/ hdf-ambari-mpack-3.5.1.*/common-services/NIFI/1.0.0/metainfo.xml file


b) Ambari_Metrics:->

Replace "<timelineAppid>AMS-HBASE</timelineAppid>" with "<timelineAppid>ams-hbase</timelineAppid>" in /var/lib/ambari-server/resources/mpacks/hdf-ambari-mpack-3.5.1.*/stacks/HDF/3.2.b/services/AMBARI_METRICS/metainfo.xml file

c) Kafka:->

add "<timelineAppid>kafka_broker</timelineAppid>" after "<name>KAFKA_BROKER</name>" in /var/lib/ambari-server/resources/mpacks/hdf-ambari-mpack-3.5.1.*/stacks/HDF/3.3/services/KAFKA/metainfo.xml file


After above Change restart ambari-server.

 

Grafana Metric related issues > Sometimes we see multiple issues with the Grafana such as Metrics not available, Time-range metrics are not being shown, inaccurate information in graphs. To check the Grafana issues, perform the following sequence.

  1. Check if AMC is running fine and you could see the aggregation happening in AMC logs.

 

2021-01-22 09:22:23,051 INFO TimelineMetricHostAggregatorMinute: End aggregation cycle @ Fri Jan 22 09:22:23 UTC 2021
2021-01-22 09:23:20,780 INFO TimelineClusterAggregatorSecond: Started Timeline aggregator thread @ Fri Jan 22 09:23:20 UTC 2021
2021-01-22 09:23:20,784 INFO TimelineClusterAggregatorSecond: Last Checkpoint read : Fri Jan 22 09:20:00 UTC 2021
2021-01-22 09:23:20,784 INFO TimelineClusterAggregatorSecond: Rounded off checkpoint : Fri Jan 22 09:20:00 UTC 2021
2021-01-22 09:23:20,784 INFO TimelineClusterAggregatorSecond: Last check point time: 1611307200000, lagBy: 200 seconds.
2021-01-22 09:23:20,784 INFO TimelineClusterAggregatorSecond: Start aggregation cycle @ Fri Jan 22 09:23:20 UTC 2021, startTime = Fri Jan 22 09:20:00 UTC 2021, endTime = Fri Jan 22 09:22:00 UTC 2021
2021-01-22 09:23:20,784 INFO TimelineClusterAggregatorSecond: Skipping aggregation for metric patterns : sdisk\_%,boottime
2021-01-22 09:23:23,129 INFO TimelineClusterAggregatorSecond: Saving 23764 metric aggregates.
2021-01-22 09:23:23,868 INFO TimelineClusterAggregatorSecond: End aggregation cycle @ Fri Jan 22 09:23:23 UTC 2021​

 

    2. If data is not present in all the graphs then check for the metadata output and look for the services metrics in meta output.

    3. If data is not present in only few graphs then please check the meta-output for that particular service.
        Example:
        If data is not present over hive-server2 metrics:

 

curl -v --insecure https://<ams-host>:6188/ws/v1/timeline/metrics/metadata?appId=hiveserver2 | python -m json.tool > hiveserver2.txt

 

    4. Check if any whitelisting or blacklisting has been applied in the configuration, which might be stopping  AMS to process those metrics. Also check for
any configuration such as "Disable Minute host aggregator" is enabled in configs.

    5. There are some known issues on the latest version of Ambari over grafana. I am listing few of them here.

AMBARI-25570
AMBARI-25563
AMBARI-25383
AMBARI-25457


Happy Learning!!!!!!!

2,820 Views