Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Many Ambari "stale alerts" messages

avatar
Contributor

Hi all,

Last night I got many of the following Ambari critical alerts:

There are {x} stale alerts from {n} host(s): {components list}

where {x}, {n} and {components list} were not always the same. For example:

There are 20 stale alerts from 1 host(s): NameNode Web UI, Metrics Monitor Status, WebHCat Server Status, NameNode High Availability Health, HST Server Process, NameNode Last Checkpoint, Flume Agent Status, Oozie Server Status, ZooKeeper Failover Controller Process, HBase Master Process, ResourceManager Web UI, HDFS Upgrade Finalized State, Ambari Agent Disk Usage, NameNode Directory Status, DataNode Health Summary, Oozie Server Web UI, DRPC Server Process, NodeManager Health Summary, RegionServers Health Summary, HiveServer2 Process

After 6 minutes, Ambari sent an OK alerts:

All alerts have run within their time intervals.

These messages repeated over and over again (13 critical, then 13 OK in 5 hours). This is the first time I see so many alerts from our cluster in one single night and all the services are fine from Ambari this morning. No more alerts either.

Does anybody have any insight what might cause this?

Thank you very much in advance!

Xi Sanderson

1 ACCEPTED SOLUTION

avatar
Contributor

Hi all,

I opened a support ticket and got answer back regarding metastore alerts. It is a known bug in the Ambari release I have (2.1.2):

https://issues.apache.org/jira/browse/AMBARI-14424

The suggested solution is to change script:

/var/lib/ambari-server/resources/common-services/HIVE/0.12.0.2.0/package/alerts/alert_hive_metastore.py

search for 30 and replace with 120, then restart Ambari server.

Still yet to monitor how the changes work.

Thank for all the helps from you guys!

Xi

View solution in original post

13 REPLIES 13

avatar
Master Mentor

avatar
Contributor

Hi Neeraj,

Thank you very much for the link. I will give it a try.

Xi

avatar
Master Mentor
@Xi Sanderson

Take a look on DNS entries and network timeout settings. Tagging @Jeff Groves as I know he has done lot of work on alerts.

avatar
Contributor

Hi Xi, I just wrote-up a how-to on what looks like the same issue that you are experiencing. Please take a look and see if it improves the situation that you are experiencing. The gist of the fix is to increase the Kerberos ticket lifetime to a value larger than the check interval itself (by default 5 minutes):

https://community.hortonworks.com/articles/10464/ambari-alerts-phantom-or-false-alerts-on-kerberize....

As usual, it is important to test any changes on a non-production environment first.

Thanks,

Jeff G.

avatar
Contributor

Hi Jeff,

Thanks for the information. Our cluster is not Kerberized, so the solution might not help me.

Xi

avatar
Master Mentor

@Xi Sanderson are you still having issues with this? Can you accept the best answer or post your solution?

avatar
Master Mentor

@Xi Sanderson please paste the logs for hive and whatever other alerts are being referenced. We can't debug without more detail. Do you have support account with Hortonworks? We recommend you enable SmartSense proactive monitoring as well as open tickets for individual problems if we are not able to address it here.

avatar
Contributor

Hi Artem,

I implemented the suggestion in the thread Neeraj referred, but still have the issue. On light days, I get 5, 6; on heavy days, still over 10.

I am also getting a lot of Hive Metastore check alerts (... '"'"'show databases;'"'"''' was killed due timeout after 30 seconds) with OK and Critical in the same email. Last night I got hundreds of those. It has to do with the load on the cluster.

Any help is appreciated!

Xi

avatar
Contributor

Hi,

Yes, we are using SmartSense. I will open a support ticket too.

Here is one of the alerts:

Services Reporting Alerts

OK [HIVE]
CRITICAL [HIVE]

HIVE

OK Hive Metastore Process

Metastore OK - Hive command took 9.718s

CRITICAL Hive Metastore Process

Metastore on be-bi-secondary-528.soleocommunications.com failed (Execution of 'ambari-sudo.sh su ambari-qa -l -s /bin/bash -c 'export PATH='"'"'/usr/sbin:/sbin:/usr/lib/ambari-server/*:/sbin:/usr/sbin:/bin:/usr/bin:/var/lib/ambari-agent:/bin/:/usr/bin/:/usr/sbin/:/usr/hdp/current/hive-metastore/bin'"'"' ; export HIVE_CONF_DIR='"'"'/usr/hdp/current/hive-metastore/conf/conf.server'"'"' ; hive --hiveconf hive.metastore.uris=thrift://be-bi-secondary-528.soleocommunications.com:9083 --hiveconf hive.metastore.client.connect.retry.delay=1 --hiveconf hive.metastore.failure.retries=1 --hiveconf hive.metastore.connect.retries=1 --hiveconf hive.metastore.client.socket.timeout=14 --hiveconf hive.execution.engine=mr -e '"'"'show databases;'"'"''' was killed due timeout after 30 seconds)

This notification was sent to Ambari Alert From TheOracle Apache Ambari 2.1.2

Thanks,

Xi