Created 11-19-2018 06:46 AM
my ambari metric collector always doesn't work.
In log, I found
Failed to get result with timeout, timeout = 300000ms row 'METRIC_AGGREGATE' on table 'SYSTEM.CATALOG' at region=SYSTEM.CATALOG, ***** host name=[hostname], [port] *** seNum=6....causes by ....IOException: Failed to get result with timeout, timeout=300000ms
then stuck at
org.apache.hadoop.hbase.client.AsynProcess:#1 waiting for 28379 actions to finish
It seems relate to hbase, but my hbase runs well without any error
If I restart ambari metric collector, it recovers immediately, and become unavailable again after several hours.
How to fix it? Thanks
Created 11-19-2018 12:21 PM
1. Which version of AMS are you using?
If you are using Amabri 2.6.x then i will suggest you to use ambari metrics collector that comes with Ambari 2.6.2.2 is much more stable and has many additional fixes included in it.
2. Can you please share the logs resent inside the /var/log/ambari-metrics-collector/ share ambari-metrics-collector.log, ams-hbase-master.log, GC logs collector-gc.log and gc.log
3. When the AMS is running fine that time can you please collect the output of the following API calls and attach the JSON output here which will help us in knowing if there is a need for any Performance tuning.
http://<ams-host>:6188/ws/v1/timeline/metrics/metadata http://<ams-host>:6188/ws/v1/timeline/metrics/hosts
.
Created 11-20-2018 02:27 AM
I am using Ambari 2.6.1.0
The ambari metrics collector log is stored on production environment. I cannot export it. And I can give you the error log by hand typing copy:
now In embedded mode,
ambari metrics log:
MetaDataProtos$MetaDataService for row \x00\x00METRIC_RECORD
....
Caused by java.lang.InterruptedException
ams hbase log:
FSDataInputStreamWrapper: Failed to invoke 'unbuffer' method in class class org.apache.hadoop.fs.FSDataInputStream
So there may be a TCP socket connection left open in CLOSE_WAIT state
....
caused by java.lang.UnsupportedOperationException: this stream does not support ubbuffering
All settings for ams are in default. There are 6 nodes in total in cluster. The host running ams has 64 cores and 256GB. Currently it has 53GB free memory and 233GB free memory in cache
Created 11-20-2018 02:36 AM
You seems to be hitting a known bug of Ambari Metrics Collector 2.6.1 which causes too many CLOSE_WAIT sockets and ultimately leads to AMS shutdown after some time.
You will notice a growing CLOSE_WAIT socket over a period of time.
# netstat -anlp | grep :6188 | grep CLOSE_WAIT | wc -l
.
I am sure that apart from the following entry in your logs
TCP socket connection left open in CLOSE_WAIT state
You will also find the following kind of logging inside your "/var/log/ambari-metrics-collector/hbase-ams-master-*.log
" log if you notice that line means you are hitting the same bug.
Failed to invoke 'unbuffer' method in class class org.apache.hadoop.fs.FSDataInputStream
.
Remedy:
You should upgrade to Ambari 2.6.2.2. Which has many additional fixes including some security related fixes from AMS perspective.
Post Upgrade Steps which includes AMS upgrade is mentioned in the following Doc:
Created 11-20-2018 02:46 AM
for netstat -anlp | grep :6188 | grep CLOSE_WAIT | wc -l
I get 0. May my restarting the service solve this problem temporarily.
You mean the CLOSE_WAITING problem relates to
Failed to invoke 'unbuffer' method inclassclass org.apache.hadoop.fs.FSDataInputStream
?
Created 11-20-2018 03:07 AM
Yes CLOSE_WAIT connection message in your logs indicates to the same Bug which is addressed in Ambari 2.6.2.2
TCP socket connection left open in CLOSE_WAIT state
.
So better to Upgrade.
Restarting the AMS will temporarily fix the issue but after some time you may again notice that the AMS went down. Permenent remedy will be to upgrade the Ambari amd AMS to 2.6.2.2
Created 11-20-2018 03:11 AM
So can I upgrade the ambari metrics service individually , besides to the whole Ambari?
Created 11-20-2018 05:42 AM
Yes, but better to upgrade ambari amd AMS together as that is most recommended way.