About snm1523

snm1523 · ‎06-03-2019

Hello Harsh, Thank you for the help on this. I was able to identify some information that helped here. Will come back in case need further help. Will accept your reply as Solution. 🙂 Thanks snm1523

snm1523 · ‎03-07-2019

Thank you for the reply Harsh J. Would you be able to please help me with any quick command / script to identify avoidable open files or files stuck in some process using 'lsof' and guide further actions to take? I tried running a generic 'lsof | grep java' but it obviously gave me a huge list of files and became a bit difficult to get relevant information. Thanks snm1523

snm1523 · ‎03-06-2019

Hello All, I am looking for some best practices or recommendations to set a best possible value for rlimit_fds (Maximum Process File Descriptors) property. Currently, it is set to default i.e. 32768 and we are getting File Descriptor Threshold alerts. We would first like to look for a best possible value for rlimit_fds. Is there a formulae or a practice or few checks that can be performed to determine a best value? Thanks snm1523

snm1523 · ‎02-25-2019

Hello gzigldrum, Thank you for the guidance. This certainly helps. I will go through both the KB articles you shared and also review the Navigator Server and CM Agent logs for audit health warnings. Will comeback with the findings. In the meantime, any luck with HIVEMETASTORE_CANARY_HEALTH and NODE_MANAGER_WEB_METRIC_COLLECTION? Thanks snm1523

snm1523 · ‎02-22-2019

Hello Gzigldrum, Thank you for the reply. Below are the exact health messages from CM for each alert: HIVEMETASTORE_CANARY_HEALTH: The health test result for HIVEMETASTORE_CANARY_HEALTH has become bad: The Hive Metastore canary failed to create a database. OR The health test result for HIVEMETASTORE_CANARY_HEALTH has become bad: The Hive Metastore canary failed to create a partition in the table it created. OR The health test result for HIVEMETASTORE_CANARY_HEALTH has become bad: The Hive Metastore canary failed to drop the table it created. REGION_SERVER_AUDIT_HEALTH: The health test result for REGION_SERVER_AUDIT_HEALTH has become bad: There is a problem processing audits for REGIONSERVER. IMPALAD_QUERY_MONITORING_STATUS: he health test result for IMPALAD_QUERY_MONITORING_STATUS has become bad: There are 1 error(s) seen monitoring executing queries, and 0 errors(s) seen monitoring completed queries for this role in the previous 5 minute(s). Critical threshold: any. HIVESERVER2_SCM_HEALTH: The health test result for HIVESERVER2_SCM_HEALTH has become bad: This role s process is starting. This role is supposed to be started. NAME_NODE_AUDIT_HEALTH: The health test result for NAME_NODE_AUDIT_HEALTH has become bad: There is a problem processing audits for NAMENODE. NODE_MANAGER_WEB_METRIC_COLLECTION: The health test result for NODE_MANAGER_WEB_METRIC_COLLECTION has become bad: The Cloudera Manager Agent is not able to communicate with this role s web server. These health alerts occur in every 2-3 days in a time frame of 2-3 hours creating around 5-10 tickets during each interval. We have also checked from network side to verify if there was a network outage or a glitch in those windows, however, no luck. We have tried to diagnose through logs of each alert, but haven't found anything interesting. Hence, looking for some more guidance on what additional could be checked to identify a root cause of these. Thanks snm1523

snm1523 · ‎02-21-2019

Thank you for the quick replies, Gzigldrum. I have created a separate post explaining the alerts we get. Below is the link to the same. Flooded with failed health test alerts For every alert we get an incident in BMC as HP OVO is configured to generate an incident and assign it to us. Every morning we start with close to around 15-20 such incidents followed by 15-20 in rest of the day. Would be great if you could please post some suggestions to troubleshoot those alerts and help in permanently fixing them. Note: I will accept your previous reply as solution to this post. Thanks snm1523

snm1523 · ‎02-21-2019

Hello Gzigldrum, Thank you for the reply. However, the instructions provided in the KB article and values suggested by you are already applied. Please advise. Thanks Snm1523

snm1523 · ‎02-21-2019

Hello All, Need some suggestions on the exact reason of below heath tests getting failed multiple times, almost 3 times every day and generates atleast 3-4 alerts each time. 1. HIVEMETASTORE_CANARY_HEALTH 2. REGION_SERVER_AUDIT_HEALTH 3. IMPALAD_QUERY_MONITORING_STATUS 4. HIVESERVER2_SCM_HEALTH 5. NAME_NODE_AUDIT_HEALTH 6. NODE_MANAGER_WEB_METRIC_COLLECTION Any help or suggestion to permanently fix these alerts would of great help. Also, if anyone could also guide to reach root cause of this would also be helpful Thanks snm1523

snm1523 · ‎02-21-2019

Hello All, We have a HP OVO monitoring tool monitoring all the alerts from Cloudera Manager and raising a BMC Remedy Incident accordingly. However, we are at times flooded with these monitoring alert tickets and when it is immediately checked in the cluster, everything looks green. When we dig in for detailed analysis and check logs of the respective alert, we do not have anything major and also the service is green. It looks like it heals by itself. However, this flood of tickets, raises concerns with Management and questions for a root cause, which we do not have in reality and also unable to find anything as it was auto fixed with no special traces behind. I am looking out for recommendations / best practices on how exactly this should be setup, so we get only the actual / required alerts. Is there any configuration we need to do in CM or there is something that can be configured in HP OVO. Any suggestions would be of a great help. Also, if anyone has a suggestion that should be checked while troubleshooting these alerts would also be welcomed. Thanks snm1523

snm1523 · ‎02-18-2019

Thank you gzigldrum. This certainly helps. Regards, snm1523

Online	Offline
Last Visited	‎11-07-2025 08:17 AM

Member Since	‎10-29-2015 07:36 PM
Last Visited	‎11-07-2025 08:17 AM
Posts	128
Kudos received	31

Cloudera Community

Re: YARN and HDFS monitoring via Grafana

Re: Enable Admin account for Cloudera Manager

Re: Datanode not starting: SIGTERM error

Re: MKDirs failed to create file

Re: Calculate File Descriptor in HBase

Re: Calculate File Descriptor in HBase

Calculate File Descriptor in HBase

Re: Flooded with failed health test alerts

Re: Flooded with failed health test alerts

Re: Best Practice to monitor alerts from Cloudera ...

Re: Best Practice to monitor alerts from Cloudera ...

Flooded with failed health test alerts

Best Practice to monitor alerts from Cloudera Mana...

Re: Cloudera Manager user activity audit