Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Flooded with failed health test alerts

avatar
Expert Contributor

Hello All,

 

Need some suggestions on the exact reason of below heath tests getting failed multiple times, almost 3 times every day and generates atleast 3-4 alerts each time.

 

1. HIVEMETASTORE_CANARY_HEALTH

2. REGION_SERVER_AUDIT_HEALTH

3. IMPALAD_QUERY_MONITORING_STATUS

4. HIVESERVER2_SCM_HEALTH

5. NAME_NODE_AUDIT_HEALTH

6. NODE_MANAGER_WEB_METRIC_COLLECTION

 

Any help or suggestion to permanently fix these alerts would of great help. Also, if anyone could also guide to reach root cause of this would also be helpful

 

Thanks

snm1523

1 ACCEPTED SOLUTION

avatar
Super Collaborator
NAME_NODE_AUDIT_HEALTH:
The health test result for NAME_NODE_AUDIT_HEALTH has become bad: There is a problem processing audits for NAMENODE.

REGION_SERVER_AUDIT_HEALTH:
The health test result for REGION_SERVER_AUDIT_HEALTH has become bad: There is a problem processing audits for REGIONSERVER.

Those indicate that there are issues sending the audit event logs created by these roles to the Navigator Audit Server. The root cause is either in the Navigator Audit Server or in the CM agent on the host with the role deployed to. Furher investigation needs to be to review the CM agent logs on the host(s) and the Navigator Audit Server logs at the time when the alert was raised. Any errors or exception seen at this time will indicate the root cause or at least provide pointers to base further investigation on.

View solution in original post

6 REPLIES 6

avatar
Super Collaborator

There won't be a generic fix for all these issues as they may have different root causes. Each type of alerts has to be looked into one by one, a root cause determined and corresponding action applied. This will be best done in a support case, if you have no contract then please post the exact health alert message for the most bugging alert you are getting here, and we will help to resolve the reported issues.

avatar
Expert Contributor

Hello Gzigldrum,

 

Thank you for the reply.

 

Below are the exact health messages from CM for each alert:

 

HIVEMETASTORE_CANARY_HEALTH:
The health test result for HIVEMETASTORE_CANARY_HEALTH has become bad: The Hive Metastore canary failed to create a database.
OR
The health test result for HIVEMETASTORE_CANARY_HEALTH has become bad: The Hive Metastore canary failed to create a partition in the table it created.
OR
The health test result for HIVEMETASTORE_CANARY_HEALTH has become bad: The Hive Metastore canary failed to drop the table it created.

 

REGION_SERVER_AUDIT_HEALTH:
The health test result for REGION_SERVER_AUDIT_HEALTH has become bad: There is a problem processing audits for REGIONSERVER.

 

IMPALAD_QUERY_MONITORING_STATUS:
he health test result for IMPALAD_QUERY_MONITORING_STATUS has become bad: There are 1 error(s) seen monitoring executing queries, and 0 errors(s) seen monitoring completed queries for this role in the previous 5 minute(s). Critical threshold: any.

 

HIVESERVER2_SCM_HEALTH:
The health test result for HIVESERVER2_SCM_HEALTH has become bad: This role s process is starting. This role is supposed to be started.

 

NAME_NODE_AUDIT_HEALTH:
The health test result for NAME_NODE_AUDIT_HEALTH has become bad: There is a problem processing audits for NAMENODE.

 

NODE_MANAGER_WEB_METRIC_COLLECTION:
The health test result for NODE_MANAGER_WEB_METRIC_COLLECTION has become bad: The Cloudera Manager Agent is not able to communicate with this role s web server.

 

These health alerts occur in every 2-3 days in a time frame of 2-3 hours creating around 5-10 tickets during each interval. We have also checked from network side to verify if there was a network outage or a glitch in those windows, however, no luck. We have tried to diagnose through logs of each alert, but haven't found anything interesting.

 

Hence, looking for some more guidance on what additional could be checked to identify a root cause of these.

 

Thanks

snm1523

avatar
Super Collaborator
HIVESERVER2_SCM_HEALTH:
The health test result for HIVESERVER2_SCM_HEALTH has become bad: This role s process is starting. This role is supposed to be started.

This should not be happening frequently, if it does please follow up with a support case. The issue can be resolved with instructions in KB article Role Managed by Cloudera Manager Stuck in Stopping or Starting State | Configured_Status

avatar
Super Collaborator
NAME_NODE_AUDIT_HEALTH:
The health test result for NAME_NODE_AUDIT_HEALTH has become bad: There is a problem processing audits for NAMENODE.

REGION_SERVER_AUDIT_HEALTH:
The health test result for REGION_SERVER_AUDIT_HEALTH has become bad: There is a problem processing audits for REGIONSERVER.

Those indicate that there are issues sending the audit event logs created by these roles to the Navigator Audit Server. The root cause is either in the Navigator Audit Server or in the CM agent on the host with the role deployed to. Furher investigation needs to be to review the CM agent logs on the host(s) and the Navigator Audit Server logs at the time when the alert was raised. Any errors or exception seen at this time will indicate the root cause or at least provide pointers to base further investigation on.

avatar
Super Collaborator
IMPALAD_QUERY_MONITORING_STATUS:
he health test result for IMPALAD_QUERY_MONITORING_STATUS has become bad: There are 1 error(s) seen monitoring executing queries, and 0 errors(s) seen monitoring completed queries for this role in the previous 5 minute(s). Critical threshold: any.

Please see KB article Impala | IMPALAD_QUERY_MONITORING_STATUS has become bad for resolution.

avatar
Expert Contributor

Hello gzigldrum,

 

Thank you for the guidance. This certainly helps.

 

I will go through both the KB articles you shared and also review the Navigator Server and CM Agent logs for audit health warnings.

 

Will comeback with the findings.

 

In the meantime, any luck with HIVEMETASTORE_CANARY_HEALTH and NODE_MANAGER_WEB_METRIC_COLLECTION?

 

Thanks

snm1523