Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Best Practice to monitor alerts from Cloudera Manager

avatar
Expert Contributor

Hello All,

 

We have a HP OVO monitoring tool monitoring all the alerts from Cloudera Manager and raising a BMC Remedy Incident accordingly. However, we are at times flooded with these monitoring alert tickets and when it is immediately checked in the cluster, everything looks green. When we dig in for detailed analysis and check logs of the respective alert, we do not have anything major and also the service is green. It looks like it heals by itself.

 

However, this flood of tickets, raises concerns with Management and questions for a root cause, which we do not have in reality and also unable to find anything as it was auto fixed with no special traces behind. I am looking out for recommendations / best practices on how exactly this should be setup, so we get only the actual / required alerts.

 

Is there any configuration we need to do in CM or there is something that can be configured in HP OVO. Any suggestions would be of a great help. Also, if anyone has a suggestion that should be checked while troubleshooting these alerts would also be welcomed.

 

Thanks

snm1523

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Thanks for confirming. Then the alert related settings appear to be correct.

 

The way forward has to be to reduce the number of alerts raised for this cluster, either by resolving the related issues or by adjusting the corresponding alert thresholds if possible (or even disable the health test where unneeded). What kind of alerts do you see most frequently? Please share examples.

View solution in original post

4 REPLIES 4

avatar
Super Collaborator

As a first step, please verify your settings following instructions in this KB article

In CM -> Administration -> Alerts, do you have these values set:

Alert On Transitions Out of Alerting Health: No
Health Alert Threshold: Bad

These are the default values, if e.g. the threshold is set to "Concerning" then please revert back to "Bad"

avatar
Expert Contributor
Hello Gzigldrum,

Thank you for the reply.

However, the instructions provided in the KB article and values suggested by you are already applied.

Please advise.

Thanks
Snm1523

avatar
Super Collaborator

Thanks for confirming. Then the alert related settings appear to be correct.

 

The way forward has to be to reduce the number of alerts raised for this cluster, either by resolving the related issues or by adjusting the corresponding alert thresholds if possible (or even disable the health test where unneeded). What kind of alerts do you see most frequently? Please share examples.

avatar
Expert Contributor

Thank you for the quick replies, Gzigldrum.

 

I have created a separate post explaining the alerts we get. Below is the link to the same.

 

Flooded with failed health test alerts

 

For every alert we get an incident in BMC as HP OVO is configured to generate an incident and assign it to us.

 

Every morning we start with close to around 15-20 such incidents followed by 15-20 in rest of the day.

 

Would be great if you could please post some suggestions to troubleshoot those alerts and help in permanently fixing them.

 

Note: I will accept your previous reply as solution to this post.

 

Thanks
snm1523