Created on 01-18-2016 08:03 PM - edited 09-16-2022 02:58 AM
Hi all,
Last night I got many of the following Ambari critical alerts:
There are {x} stale alerts from {n} host(s): {components list}
where {x}, {n} and {components list} were not always the same. For example:
There are 20 stale alerts from 1 host(s): NameNode Web UI, Metrics Monitor Status, WebHCat Server Status, NameNode High Availability Health, HST Server Process, NameNode Last Checkpoint, Flume Agent Status, Oozie Server Status, ZooKeeper Failover Controller Process, HBase Master Process, ResourceManager Web UI, HDFS Upgrade Finalized State, Ambari Agent Disk Usage, NameNode Directory Status, DataNode Health Summary, Oozie Server Web UI, DRPC Server Process, NodeManager Health Summary, RegionServers Health Summary, HiveServer2 Process
After 6 minutes, Ambari sent an OK alerts:
All alerts have run within their time intervals.
These messages repeated over and over again (13 critical, then 13 OK in 5 hours). This is the first time I see so many alerts from our cluster in one single night and all the services are fine from Ambari this morning. No more alerts either.
Does anybody have any insight what might cause this?
Thank you very much in advance!
Xi Sanderson
Created 02-26-2016 01:22 PM
Hi all,
I opened a support ticket and got answer back regarding metastore alerts. It is a known bug in the Ambari release I have (2.1.2):
https://issues.apache.org/jira/browse/AMBARI-14424
The suggested solution is to change script:
/var/lib/ambari-server/resources/common-services/HIVE/0.12.0.2.0/package/alerts/alert_hive_metastore.py
search for 30 and replace with 120, then restart Ambari server.
Still yet to monitor how the changes work.
Thank for all the helps from you guys!
Xi
Created 01-18-2016 08:05 PM
Created 01-20-2016 07:19 PM
Hi Neeraj,
Thank you very much for the link. I will give it a try.
Xi
Created 02-25-2016 05:24 PM
Take a look on DNS entries and network timeout settings. Tagging @Jeff Groves as I know he has done lot of work on alerts.
Created 01-18-2016 11:04 PM
Hi Xi, I just wrote-up a how-to on what looks like the same issue that you are experiencing. Please take a look and see if it improves the situation that you are experiencing. The gist of the fix is to increase the Kerberos ticket lifetime to a value larger than the check interval itself (by default 5 minutes):
As usual, it is important to test any changes on a non-production environment first.
Thanks,
Jeff G.
Created 01-20-2016 07:16 PM
Hi Jeff,
Thanks for the information. Our cluster is not Kerberized, so the solution might not help me.
Xi
Created 02-04-2016 02:39 AM
@Xi Sanderson are you still having issues with this? Can you accept the best answer or post your solution?
Created 02-25-2016 03:49 PM
@Xi Sanderson please paste the logs for hive and whatever other alerts are being referenced. We can't debug without more detail. Do you have support account with Hortonworks? We recommend you enable SmartSense proactive monitoring as well as open tickets for individual problems if we are not able to address it here.
Created 02-25-2016 03:49 PM
Hi Artem,
I implemented the suggestion in the thread Neeraj referred, but still have the issue. On light days, I get 5, 6; on heavy days, still over 10.
I am also getting a lot of Hive Metastore check alerts (... '"'"'show databases;'"'"''' was killed due timeout after 30 seconds) with OK and Critical in the same email. Last night I got hundreds of those. It has to do with the load on the cluster.
Any help is appreciated!
Xi
Created 02-25-2016 04:25 PM
Hi,
Yes, we are using SmartSense. I will open a support ticket too.
Here is one of the alerts:
OK | [HIVE] |
CRITICAL | [HIVE] |
OK | Hive Metastore Process
Metastore OK - Hive command took 9.718s |
CRITICAL | Hive Metastore Process
Metastore on be-bi-secondary-528.soleocommunications.com failed (Execution of 'ambari-sudo.sh su ambari-qa -l -s /bin/bash -c 'export PATH='"'"'/usr/sbin:/sbin:/usr/lib/ambari-server/*:/sbin:/usr/sbin:/bin:/usr/bin:/var/lib/ambari-agent:/bin/:/usr/bin/:/usr/sbin/:/usr/hdp/current/hive-metastore/bin'"'"' ; export HIVE_CONF_DIR='"'"'/usr/hdp/current/hive-metastore/conf/conf.server'"'"' ; hive --hiveconf hive.metastore.uris=thrift://be-bi-secondary-528.soleocommunications.com:9083 --hiveconf hive.metastore.client.connect.retry.delay=1 --hiveconf hive.metastore.failure.retries=1 --hiveconf hive.metastore.connect.retries=1 --hiveconf hive.metastore.client.socket.timeout=14 --hiveconf hive.execution.engine=mr -e '"'"'show databases;'"'"''' was killed due timeout after 30 seconds) |
This notification was sent to Ambari Alert From TheOracle Apache Ambari 2.1.2
Thanks,
Xi