02-21-2019 03:54 AM - edited 02-21-2019 06:23 AM
We have a HP OVO monitoring tool monitoring all the alerts from Cloudera Manager and raising a BMC Remedy Incident accordingly. However, we are at times flooded with these monitoring alert tickets and when it is immediately checked in the cluster, everything looks green. When we dig in for detailed analysis and check logs of the respective alert, we do not have anything major and also the service is green. It looks like it heals by itself.
However, this flood of tickets, raises concerns with Management and questions for a root cause, which we do not have in reality and also unable to find anything as it was auto fixed with no special traces behind. I am looking out for recommendations / best practices on how exactly this should be setup, so we get only the actual / required alerts.
Is there any configuration we need to do in CM or there is something that can be configured in HP OVO. Any suggestions would be of a great help. Also, if anyone has a suggestion that should be checked while troubleshooting these alerts would also be welcomed.
02-21-2019 06:21 AM
As a first step, please verify your settings following instructions in this KB article.
In CM -> Administration -> Alerts, do you have these values set:
Alert On Transitions Out of Alerting Health: No Health Alert Threshold: Bad
These are the default values, if e.g. the threshold is set to "Concerning" then please revert back to "Bad"
02-21-2019 06:41 AM
02-21-2019 06:51 AM
Thanks for confirming. Then the alert related settings appear to be correct.
The way forward has to be to reduce the number of alerts raised for this cluster, either by resolving the related issues or by adjusting the corresponding alert thresholds if possible (or even disable the health test where unneeded). What kind of alerts do you see most frequently? Please share examples.
02-21-2019 07:11 AM
Thank you for the quick replies, Gzigldrum.
I have created a separate post explaining the alerts we get. Below is the link to the same.
For every alert we get an incident in BMC as HP OVO is configured to generate an incident and assign it to us.
Every morning we start with close to around 15-20 such incidents followed by 15-20 in rest of the day.
Would be great if you could please post some suggestions to troubleshoot those alerts and help in permanently fixing them.
Note: I will accept your previous reply as solution to this post.