Member since
09-28-2015
95
Posts
51
Kudos Received
13
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1093 | 09-14-2017 11:20 PM | |
1695 | 06-20-2017 06:26 PM | |
841 | 05-31-2017 07:27 PM | |
513 | 02-07-2017 06:24 PM | |
3701 | 01-04-2017 11:11 PM |
01-10-2019
01:23 PM
Please stop the collector, clean up the /ambari-metrics-cluster zndoe as well and start. Alternately, you can set custom ams-site : timeline.metrics.service.distributed.collector.mode.disabled = false.
... View more
06-04-2018
09:25 PM
2 Kudos
Michael Bronson This is a known issue in the HBase version used by AMS in Ambari 2.6.1. Please downgrade AMS version to 2.6.0 using the following steps. Update ambari.repo file on Metrics collector host to point to 2.6.0.0 release yum clean all Stop AMS. yum remove ambari-metrics-collector yum install ambari-metrics-collector Verify version of AMS jar - /usr/lib/ambari-metrics-collector/ambari-metrics-*.jar Start AMS. Update repo file back to 2.6.1 version so that we don't disturb Ambari's setup. There were minimal changes in AMS from 2.6.0 to 2.6.1. You can also bring back the 2.6.1 versions of ambari-metrics-* jars in /usr/lib/ambari-metrics-collector after the yum downgrade. Meaning, using newest version of AMS jars + older version of HBase.
... View more
12-05-2017
06:26 PM
@Ivan Majnaric As of now, Grafana version upgrade to 4.0 is in the plan for Ambari-3.0.0
... View more
12-05-2017
06:21 PM
@Paramesh malla What version of Ambari is this? If it is 2.5.1, we have a known issue - https://issues.apache.org/jira/browse/AMBARI-21261
... View more
10-30-2017
05:03 PM
@Sidharth Kumar What mode is AMS operating on - embedded / distributed? Can you share AMS HBase Master / RS log? (/var/log/ambari-metrics-collector/hbase-ams*.log)
... View more
10-30-2017
05:03 PM
@Sidharth Kumar What mode is AMS operating on - embedded / distributed? Can you share AMS HBase Master / RS log? (/var/log/ambari-metrics-collector/hbase-ams*.log)
... View more
10-30-2017
05:01 PM
@elkan li What Ambari version is this? Can you attach HDFS Namenode / Datanode logs?
... View more
10-26-2017
08:10 PM
3 Kudos
Understanding scale issues in AMS (Why does it hapen) The Metrics Collector component is the central daemon that receives metrics from ALL the service sinks and monitors that sends metrics. The collector uses HBase as its store and phoenix as the data accessor layer. In a high level, the metrics collector performs 2 operations related to scale in a continuos basis.
Handle raw writes - A raw write is a bunch of metric data points received from services written onto HBase through phoenix. There is no read or aggregation involved. Periodically aggregate data - AMS aggregates data across cluster and across time.
Cluster Aggregator - Computing the min,max,avg and sum of memory across all hosts is done by a cluster aggregator. This is called a 'TimelineClusterAggregatorSecond' which runs every 2 mins. In every run it reads the entire last 2 mins data and calculates aggregates and writes back. The read is expensive since it has to read non-aggregated data, while the write volume is smaller since it is aggregated data. For example, in a 100 node cluster, mem_free from 100 hosts becomes 1 aggregate metric value in this aggregator. Time Aggregator - Also called 'downsampling', this aggregator rolls up the data in the time dimension. This helps AMS TTL out smaller precision seconds data and hold aggregate data for a longer time. For example, if we have data point for every 10 seconds, the 5min time aggregator takes the 30 data points every 5 mins and creates 1 rolled up value. There are higher level downsamplers (1hour, 1day) as well, and they use their immediate predecessors data (1hr => 5mins, 1day => 1hr ). However, it is the 5min aggregator that is high compute since it reads the entire last 5 mins data and downsamples it. Again, the read is very expensive since it has to read non-aggregated data, while the write volume is smaller. This downsampler is called 'TimelineHostAggregatorMinute' Scale problems occur in AMS when one or both of the above operations cannot happen smoothly. The 'load' on AMS is decided based on following factors
How many hosts in the cluster? How many metrics each component is sending to AMS? Either of the above can cause performance issues in AMS. How do we find out if AMS is experiencing scale problems? One or more of the following consequences can be seen on the cluster.
Metrics Collector shuts down intermittently. Since Auto Restart is enabled for Metrics collector by default, this will up show as an alert stating 'Metrics collector has been auto restarted # times the last 1 hour'. Partial metrics data is seen.
All non-aggregated host metrics are seen (HDFS Namenode metrics / Host summary page on Ambari / System - Servers Grafana dashboard). Aggregated data is not seen. (AMS Summary page / System - Home Grafana dashboard / HBase - Home Grafana dashboard). Step 1 : Get the current state of the system Fixing / Recovering from the problem. The above problems could occur because of a 2-3 underlying reasons. Underlying Problem
What it could cause
Fix / Workaround
Too many metrics (#4 from above) It could cause ALL of the problems mentioned above. #1 : Trying out config changes
First, we can try increasing memory of Metrics collector, HBase Master / RS based on mode. (Refer to memory configurations table at the top of the page) Configure AMS to read more data in a single phoenix fetch
Set ams-site: timeline.metrics.service.resultset.fetchSize = 5000 (for <100 nodes) or 10000 (>100 nodes) Increase Hbase regionserver handler count.
Set ams-hbase-site: hbase.regionserver.handler.count = 30 If Hive is sending a lot of metrics. Do not aggregate hive table metrics.
Set ams-site:timeline.metrics.cluster.aggregation.sql.filters = sdisk_%,boottime,default.General% (Only From Ambari-2.5.0) #2 : Reducing number of metrics If the above config changes do not increase AMS stability, you can whitelist selected metrics or blacklist certain components' metrics that are causing the load issue.
Whitelisting doc : Ambari Metrics - Whitelisting
AMS node has slow disk speed. Disk is not able to keep up with high volume data. It can cause raw writes and aggregation problems.
On larger clusters (>800 nodes) with distributed mode, suggest 3-5 SSDs on metrics collector node and create a config group for DN on that host to use those 3-5 disks as directories. ams-hbase-site :: hbase.rootdir - Change this path to a disk mount that is not heavily contended. ams-hbase-ste :: hbase.tmp.dir - Change this path to a location different from hbase.rootdir ams-hbase-ste :: hbase.wal.dir - Change this path to a location different from hbase.root.dir (From Ambari-2.5.1) Metric whitelisting will help in decreasing metric load.
Known issues around HBase normalier and FIFO compaction. Documented in Known Issues (#11 and #13) This can be identified in #5 in the above table. Follow workaround steps in Known issue doc. Other Advanced Configurations Configuration
Property
Description
Minimum Recommended values (Host Count => MB)
ams-site phoenix.query.maxGlobalMemoryPercentage Percentage of total heap memory used by Phoenix threads in the Metrics Collector API/Aggregator daemon. 20 - 30, based on available memory. Default = 25.
ams-site phoenix.spool.directory Set directory for Phoenix spill files. (Client side) Set this to different disk from hbase.rootdir dir if possible.
ams-hbase-site phoenix.spool.directory Set directory for Phoenix spill files. (Server side) Set this to different disk from hbase.rootdir dir if possible.
ams-hbase-site phoenix.query.spoolThresholdBytes Threshold size in bytes after which results from parallelly executed query results are spooled to disk. Set this to higher value based on available memory. Default is 12 mb.
... View more
- Find more articles tagged with:
- ambari-metrics
- ambari-metrics-collector
- Issue Resolution
Labels:
10-11-2017
08:38 PM
@darkz yu Yes, that sounds right. Sorry I missed the step which would have triggered the modification of the '/etc/hadoop/conf/hadoop-metrics2.properties' file. So, is your problem fixed now?
... View more
10-11-2017
06:45 PM
@darkz yu I believe those were missed out since they are less used than the current DN metrics in the Grafana dashboard. However, you can create a new Datanode JVM dashboard on Grafana though. The following jvm metrics are available and collected. jvm.JvmMetrics.GcCount jvm.JvmMetrics.GcCountConcurrentMarkSweep jvm.JvmMetrics.GcCountParNew jvm.JvmMetrics.GcNumInfoThresholdExceeded jvm.JvmMetrics.GcNumWarnThresholdExceeded jvm.JvmMetrics.GcTimeMillis jvm.JvmMetrics.GcTimeMillisConcurrentMarkSweep jvm.JvmMetrics.GcTimeMillisParNew jvm.JvmMetrics.GcTotalExtraSleepTime jvm.JvmMetrics.LogError jvm.JvmMetrics.LogFatal jvm.JvmMetrics.LogInfo jvm.JvmMetrics.LogWarn jvm.JvmMetrics.MemHeapCommittedM jvm.JvmMetrics.MemHeapMaxM jvm.JvmMetrics.MemHeapUsedM jvm.JvmMetrics.MemMaxM jvm.JvmMetrics.MemNonHeapCommittedM jvm.JvmMetrics.MemNonHeapMaxM jvm.JvmMetrics.MemNonHeapUsedM jvm.JvmMetrics.ThreadsBlocked jvm.JvmMetrics.ThreadsNew jvm.JvmMetrics.ThreadsRunnable jvm.JvmMetrics.ThreadsTerminated jvm.JvmMetrics.ThreadsTimedWaiting jvm.JvmMetrics.ThreadsWaiting
... View more
10-03-2017
06:53 PM
@darkz yu This is a known regression fixed in Ambari 2.5.2 (https://issues.apache.org/jira/browse/AMBARI-21328) To workaround, you can Replace the contents of the file '/var/lib/ambari-server/resources/stacks/HDP/2.0.6/hooks/before-START/templates/hadoop-metrics2.properties.j2' with content from https://github.com/apache/ambari/blob/branch-2.5/ambari-server/src/main/resources/stacks/HDP/2.0.6/hooks/before-START/templates/hadoop-metrics2.properties.j2 Restart Ambari Server.
... View more
09-15-2017
04:24 PM
@Mateusz Grabowski The difference could be due to when the disk stats were measured. The stats in KB are coming from the ambari agent and the the one in GB is coming from Ambari Metrics Service.
... View more
09-14-2017
11:32 PM
@Sam Debraw Stop Grafana from Ambari UI (Kill the process if needed). Delete the contents of the directory /var/lib/ambari-metrics-grafana/* Change Grafana password in Ambari UI to 'admin'. Start Grafana from Ambari UI.
... View more
09-14-2017
11:20 PM
1 Kudo
@Mateusz Grabowski 1. From the Ambari agent implementation, it seems both the counts are using the same python API - https://docs.python.org/2/library/multiprocessing.html#multiprocessing.cpu_count. Hence they should have the same value. Are you seeing different values in your API response? If not, I can create a jira to track this bug. 2. There does not seem to be a direct entry for that. You probably have to use the length of the 'disk_info' array in the JSON response to determine that. 3. Both of them are giving the same information. The first one is per disk stats in KB, and the 2nd one is total disk stats in GB.
... View more
09-14-2017
06:25 PM
@Sam Red There have been some issues fixed in the metric based alert script. What version of Ambari is this? Can you attach the response for the following metrics GET call? http:<METRICS_COLLECTOR_HOST>:6188/ws/v1/timeline/metrics?
metricNames=dfs.FSNamesystem.CapacityUsed&
appId=namenode&
hostname=<namenode_host>,<standby_namenode_host>&
startTime=<current_time - 7days>&
endTime=<current_time> The start and end times can be specified in milliseconds using the link : http://www.ruddwire.com/handy-code/date-to-millisecond-calculators/#.WbrI_tOGOqA
... View more
09-11-2017
06:31 PM
1 Kudo
Ambari Metrics Service uses psutil library to collect host metrics. The metrics's explanation can be found here - https://pythonhosted.org/psutil/. The Load graph contains the metrics for 1-minute / 5minute and 15minute Load average on the host. For more information, from the psutil docs,
Q: What about load average? A: psutil does not expose any load average function as it’s already available in python as os.getloadavg https://docs.python.org/2/library/os.html#os.getloadavg
... View more
06-20-2017
06:26 PM
2 Kudos
@Sebastien Chausson A couple of pointers. A reason why your metrics might have been discarded is that by default, AMS discards data that it is more than 5 mins in the past. You can check the value of "starttime" in your request if it is 5mins past the time of making that request. This 5min value can be changed through adding a custom ams-site config 'timeline.metrics.service.outofband.time.allowance.millis' which is the discard time boundary in milliseconds. Also, can you check if the metric was at least tracked by the AMS metadata through - http://server1.mydomain.com:6188/ws/v1/timeline/metrics/metadata URL. Search for your custom metric name or appId. To connect to AMS phoenix, you can first ssh onto the collector host. su ams cd /usr/lib/ambari-metrics-collector/bin ./sqlline.py localhost:61181:/ams-hbase-secure (Make sure you have JAVA available) You should be looking at the METRIC_RECORD table.
... View more
06-05-2017
10:39 PM
@Robin Dong Please ignore the warning with respect to AMS components heap sizes, and use the default recommendation. If the start fails even after that, please attach the logs here.
... View more
05-31-2017
07:27 PM
@marko Just 'apt-get install ambari-metrics-assembly' should be sufficient for upgrading Ambari metrics on Ubuntu 14. Can you confirm by checking the ambari-metrics-*.jar versions in /usr/lib/ambari-metrics-collector. root@avijayan-ubuntu-1:~# ls /usr/lib/ambari-metrics-collector/ambari-metrics-*
/usr/lib/ambari-metrics-collector/ambari-metrics-common-2.5.0.3.7.jar /usr/lib/ambari-metrics-collector/ambari-metrics-timelineservice-2.5.0.3.7.jar
... View more
05-08-2017
07:09 PM
1 Kudo
@Janos Geller Please try changing the following ams-site config. Config key - timeline.metrics.service.webapp.address Current Value - 0.0.0.0::host_group_1%:6188 Recommended Value - 0.0.0.0:6188 Start / Restart Metrics collector after this change.
... View more
05-08-2017
06:24 PM
@Janos Geller Can you attach the following ? /var/log/ambari-metrics-collector/ambari-metrics-collector.log /etc/ambari-metrics-collector/conf/ams-site.xml
/etc/ams-hbase/conf/hbase-site.xml
... View more
05-02-2017
12:40 AM
@Philippe Kernevez Is the Zookeeper Service on your cluster up and running?
... View more
04-26-2017
06:39 PM
1 Kudo
@Saif This issue has been fixed in https://issues.apache.org/jira/browse/AMBARI-19054. Can you upgrade to Ambari 2.5.0? Or you can manually update the /usr/sbin/ambari-metrics-grafana file with the one from Ambari 2.5.0.
... View more
04-06-2017
06:59 PM
1 Kudo
@Michael DeGuzis If your cluster has Ambari Metrics Service deployed, then I would suggest using the reference - https://cwiki.apache.org/confluence/display/AMBARI/Metrics+Collector+API+Specification. It is richer and gives you more control on the data you need to fetch.
... View more
03-22-2017
08:14 PM
1 Kudo
@Bruce Perez This intermittent issue is fixed in Ambari 2.5.0 - https://issues.apache.org/jira/browse/AMBARI-19054. Please change the pid file to match the actual PID. echo 31695 > /var/run/ambari-metrics-grafana/grafana-server.pid
... View more
03-20-2017
06:43 PM
@Julie, You can use the ams-env config in Ambari to include JAVA opts. export AMS_COLLECTOR_OPTS="$AMS_COLLECTOR_OPTS $AMS_COLLECTOR_GC_OPTS" What options are you planning to include? If you are interested in changing the table ttl, you can use the configs in ams-site to change the ttl for any AMS table. timeline.metrics.<>.aggregator.<>.ttl
Please let me know if you have any more questions.
... View more
02-07-2017
06:24 PM
When a hostname is not specified, AMS returns data aggregated across all hosts/appId that send this metric. The aggregation happens every 2 mins, creating 4 x 30 second aggregates per iteration. Add the following query parameter to your curl URL. &hostname=<resourcemanager_host> This should return most recent data.
... View more
02-02-2017
07:03 PM
1 Kudo
The cpu_wio metric corresponds to the following metric being captured by psutil.
iowait (Linux): percentage of time spent waiting for I/O to complete . For more reference, the cpu_wio is got from the following psutil API - https://pythonhosted.org/psutil/#psutil.cpu_times_percent. In the YARN page, the cpu_wio._avg is the average metric value for all nodes in the YARN cluster (nodemanagers). The cpu_wio._max is the maximum of the all the cpu_wio values from the YARN cluster. You can use the "System Servers" Grafana dashboard to delve deeper to check why higher values are seen in the graph. This metric is being captured in the "CPU - IOWAIT/INTR" section in that dashboard.
... View more
02-01-2017
09:41 PM
@Priyansh Saxena This issue seems strange. Can you increase the window size of the interval you request (around 10mins between start and and end times)? Also, can you share the full JSON you got when you get Nulls as well as when you get metrics? I can compare them to find out what is wrong.
... View more