We are using CM+CDH 5.7.0 with 500+ nodes in cluster. We are observing
"XX messages dropped by the role stage in the service monitor pipeline
over 5 minutes" messages in service monitor. This usually happens when
Role stage Queue Size is around 2K.
How can I disable service monitor checking hbase regions? I unchecked the canary option in hbase service monitoring role, but it did not prevent service monitor to check for hbase regions. We have lots of regions (200K) and that might be the problem.
When region count is below 200K everything is ok, but after 250K this problem happens consistently..
Hi, I think I found the solution but I am getting exceptions in service manager if I do it.
The property is "firehose.update.region.poller.frequency" and the default value is "1". If I change the default to 0, I prevent hbase region checks but service monitor starts to show "question marks" in service states and there are exception logs in service monitor log.
What is the correct way o turn that off?
Hey @scobanx, how did you set that property? And can you pls also paste the exception log here?
Setting the property to 0 in Service Monitor safety valve is a reasonable approach to try. The service state turns 'UNKNOWN' because there were exceptions in the background which makes Service Monitor not able to get HBase states. We need to see the stacktrace for analysis.
I set the firehose.update.region.poller.frequency value through service monitor advanced conf snippet. But this setting is not an official one, I found it while I was digging through firehose.jar.
Currently we reverted that change and blocked servicemonitor to region server traffic using firewall. In this way we can go up to 300k regions without problem.
I also want to ask you, with 600 nodes is it normal servicemonitor to use 8-10 CPU cores 2.2Ghz each?
I want to get any advice in order to decrease servicemonitor heap(64GB configured) and CPU usage..
Can you try tune the configuration of 'Service Monitor Derived Configs Advanced Configuration Snippet (Safety Valve)' for 'Cloudera Management Service', set 'hbase_table_and_region_info_task_frequency_sec' to 0.
The property you configured through Service Monitor safety valve is slightly different in this case.
Hi the derived configs are in hbase configuration page right? not in service mon config page?(I will check tomorrow at work...) Because I remember already tried setting hbase_table_and_region_info_task_frequency_sec to 0 but it did not help. Do we need to restart hbase service for this value to take affect? I cannot restart hbase service easily because it takes 2 hours to complete region assignments...
We have nearly 200 tables and total 200k regions in 600 node. Hbase,hive,oozie,and yarn are the services we use.
Like I mentioned, it is --" configuration of 'Service Monitor Derived Configs Advanced Configuration Snippet (Safety Valve)' for 'Cloudera Management Service' "
1. Click 'Cloudera Management Service' in your home page(left bottom)
2. Click 'Configuration'
3. Search for 'Service Monitor Derived Configs Advanced Configuration Snippet'.
4. Add a property and save changes
5. restart Cloudera Manager Service Monitor.
It's not an HBase configuration. You don't need to restart HBase.
I would be interested in seeing the Service Monitor Java heap and etc charts before and after this change. Pls keep me posted.
Hi, set the value as described and restarted service monitor, I did not disable firewall drop rule on service monitor host. When I look at the service monitor logs I see it is still trying to start the hbase_HBASE_TABLE_AND_REGION_INFO_TASK. Any other suggestions?
@xzheng may I know what exact values or how the values needs to be entered in Service-Wide / monitoring under the property Service Monitor Derived Configs Advanced Configuration Snippet (Safety Valve) ?
you mentioned hbase_table_and_region_info_task_frequency_sec' to 0. so how these needs to entered ?
even i am facing same issue i.e getting lots of email alerts with subject "[Cloudera Alert] HBase region health canary reported 0.14% of all regions are..." from cloudera manager. I believe these can be ignored. If so please help me with the exact parameter/value to be entered in the property ?
Regarding cpu/memory consumption. How many hbase table/regions do you have in your 600 node cluster? And can you list the services you have in your cluster as well?