question Flooded with failed health test alerts in Support Questions

Flooded with failed health test alerts

snm1523 — Fri, 16 Sep 2022 14:10:49 GMT

Hello All,

Need some suggestions on the exact reason of below heath tests getting failed multiple times, almost 3 times every day and generates atleast 3-4 alerts each time.

1. HIVEMETASTORE_CANARY_HEALTH

2. REGION_SERVER_AUDIT_HEALTH

3. IMPALAD_QUERY_MONITORING_STATUS

4. HIVESERVER2_SCM_HEALTH

5. NAME_NODE_AUDIT_HEALTH

6. NODE_MANAGER_WEB_METRIC_COLLECTION

Any help or suggestion to permanently fix these alerts would of great help. Also, if anyone could also guide to reach root cause of this would also be helpful

Thanks

snm1523

Re: Flooded with failed health test alerts

gzigldrum — Fri, 22 Feb 2019 09:55:53 GMT

There won't be a generic fix for all these issues as they may have different root causes. Each type of alerts has to be looked into one by one, a root cause determined and corresponding action applied. This will be best done in a support case, if you have no contract then please post the exact health alert message for the most bugging alert you are getting here, and we will help to resolve the reported issues.

Re: Flooded with failed health test alerts

snm1523 — Fri, 22 Feb 2019 16:17:13 GMT

Hello Gzigldrum,

Thank you for the reply.

Below are the exact health messages from CM for each alert:

HIVEMETASTORE_CANARY_HEALTH:
The health test result for HIVEMETASTORE_CANARY_HEALTH has become bad: The Hive Metastore canary failed to create a database.
OR
The health test result for HIVEMETASTORE_CANARY_HEALTH has become bad: The Hive Metastore canary failed to create a partition in the table it created.
OR
The health test result for HIVEMETASTORE_CANARY_HEALTH has become bad: The Hive Metastore canary failed to drop the table it created.

REGION_SERVER_AUDIT_HEALTH:
The health test result for REGION_SERVER_AUDIT_HEALTH has become bad: There is a problem processing audits for REGIONSERVER.

IMPALAD_QUERY_MONITORING_STATUS:
he health test result for IMPALAD_QUERY_MONITORING_STATUS has become bad: There are 1 error(s) seen monitoring executing queries, and 0 errors(s) seen monitoring completed queries for this role in the previous 5 minute(s). Critical threshold: any.

HIVESERVER2_SCM_HEALTH:
The health test result for HIVESERVER2_SCM_HEALTH has become bad: This role s process is starting. This role is supposed to be started.

NAME_NODE_AUDIT_HEALTH:
The health test result for NAME_NODE_AUDIT_HEALTH has become bad: There is a problem processing audits for NAMENODE.

NODE_MANAGER_WEB_METRIC_COLLECTION:
The health test result for NODE_MANAGER_WEB_METRIC_COLLECTION has become bad: The Cloudera Manager Agent is not able to communicate with this role s web server.

These health alerts occur in every 2-3 days in a time frame of 2-3 hours creating around 5-10 tickets during each interval. We have also checked from network side to verify if there was a network outage or a glitch in those windows, however, no luck. We have tried to diagnose through logs of each alert, but haven't found anything interesting.

Hence, looking for some more guidance on what additional could be checked to identify a root cause of these.

Thanks

snm1523

Re: Flooded with failed health test alerts

gzigldrum — Mon, 25 Feb 2019 13:38:51 GMT

HIVESERVER2_SCM_HEALTH:
The health test result for HIVESERVER2_SCM_HEALTH has become bad: This role s process is starting. This role is supposed to be started.

This should not be happening frequently, if it does please follow up with a support case. The issue can be resolved with instructions in KB article Role Managed by Cloudera Manager Stuck in Stopping or Starting State | Configured_Status

Re: Flooded with failed health test alerts

gzigldrum — Mon, 25 Feb 2019 13:46:08 GMT

NAME_NODE_AUDIT_HEALTH:
The health test result for NAME_NODE_AUDIT_HEALTH has become bad: There is a problem processing audits for NAMENODE.

REGION_SERVER_AUDIT_HEALTH:
The health test result for REGION_SERVER_AUDIT_HEALTH has become bad: There is a problem processing audits for REGIONSERVER.

Those indicate that there are issues sending the audit event logs created by these roles to the Navigator Audit Server. The root cause is either in the Navigator Audit Server or in the CM agent on the host with the role deployed to. Furher investigation needs to be to review the CM agent logs on the host(s) and the Navigator Audit Server logs at the time when the alert was raised. Any errors or exception seen at this time will indicate the root cause or at least provide pointers to base further investigation on.

Re: Flooded with failed health test alerts

gzigldrum — Mon, 25 Feb 2019 14:02:08 GMT

IMPALAD_QUERY_MONITORING_STATUS:
he health test result for IMPALAD_QUERY_MONITORING_STATUS has become bad: There are 1 error(s) seen monitoring executing queries, and 0 errors(s) seen monitoring completed queries for this role in the previous 5 minute(s). Critical threshold: any.

Please see KB article Impala | IMPALAD_QUERY_MONITORING_STATUS has become bad for resolution.

Re: Flooded with failed health test alerts

snm1523 — Mon, 25 Feb 2019 18:00:48 GMT

Hello gzigldrum,

Thank you for the guidance. This certainly helps.

I will go through both the KB articles you shared and also review the Navigator Server and CM Agent logs for audit health warnings.

Will comeback with the findings.

In the meantime, any luck with HIVEMETASTORE_CANARY_HEALTH and NODE_MANAGER_WEB_METRIC_COLLECTION?

Thanks

snm1523