Support Questions
Find answers, ask questions, and share your expertise

Service monitor keeps crashing

Hi, 

 

I've been using cloudera manager for the last one year with running of 20+ nodes. Recently i started to see heap memory size issue in Service monitor roles. I've increased from 3 to 4, then 4 to 5, and then 5 to 6 GB.But still i sometime get the service monitor crashed and restarted. During the time, the entire dashboard seems bad. What i need to do here to fix the issue?. 

 

Logs are

2021-04-26 16:10:34,938 WARN com.cloudera.enterprise.debug.JvmPauseMonitor: Detected pause in JVM or host machine (e.g. a stop the world GC, or JVM not scheduled): paused approximately 20583ms: GC pool 'G1 Young Generation' had collection(s): count=2 time=182ms, GC pool 'G1 Old Generation' had collection(s): count=1 time=20877ms
2021-04-26 16:11:34,862 WARN com.cloudera.enterprise.debug.JvmPauseMonitor: Detected pause in JVM or host machine (e.g. a stop the world GC, or JVM not scheduled): paused approximately 19870ms: GC pool 'G1 Young Generation' had collection(s): count=2 time=131ms, GC pool 'G1 Old Generation' had collection(s): count=1 time=20228ms
2021-04-26 16:12:35,132 WARN com.cloudera.enterprise.debug.JvmPauseMonitor: Detected pause in JVM or host machine (e.g. a stop the world GC, or JVM not scheduled): paused approximately 20427ms: GC pool 'G1 Young Generation' had collection(s): count=3 time=149ms, GC pool 'G1 Old Generation' had collection(s): count=1 time=20733ms
2021-04-26 16:13:36,415 WARN com.cloudera.enterprise.debug.JvmPauseMonitor: Detected pause in JVM or host machine (e.g. a stop the world GC, or JVM not scheduled): paused approximately 19008ms: GC pool 'G1 Young Generation' had collection(s): count=1 time=104ms, GC pool 'G1 Old Generation' had collection(s): count=1 time=19381ms

 

Could you please help me on this?. 

1 ACCEPTED SOLUTION

Expert Contributor

Hello

 

Depends on the cluster size and monitored entities, certain resources are recommended

 

Please refer to the below link and check whether your resource allocation is in-line with the recommendations

 

https://docs.cloudera.com/cdp-private-cloud/latest/release-guide/topics/cdpdc-service-monitor-requir...

View solution in original post

5 REPLIES 5

Expert Contributor

Hello

 

Depends on the cluster size and monitored entities, certain resources are recommended

 

Please refer to the below link and check whether your resource allocation is in-line with the recommendations

 

https://docs.cloudera.com/cdp-private-cloud/latest/release-guide/topics/cdpdc-service-monitor-requir...

Thank you @Daming Xue . This is perfect lead. 

 

Upon checking, i can see we are monitoring around 3819926 entities in the cloudera manager from the three different kafka cluster. Can you help in avoiding creating so many entities. So that ican help us to free from the alrerts keeps triggering and crashing the service monitor roles. 

 

Right now, the heap memory size is 10Gb and the non-heap memory size is 12 Gb. But still we do get the heap memory issue. What would be an ideal solution for us to fix the issues here. We have increased the Heap memory size form 3 GB to 10 Gb till now. 

 

Even we added this parameter in the service monitor and restarted the service, but this didn't help. 

-XX:+UseG1GC -XX:-UseConcMarkSweepGC -XX:-UseParNewGC

And the error from the Service monitor log is

2021-05-02 08:24:22,106 WARN com.cloudera.enterprise.debug.JvmPauseMonitor: Detected pause in JVM or host machine (e.g. a stop the world GC, or JVM not scheduled): paused approximately 33651ms: GC pool 'G1 Young Generation' had collection(s): count=2 time=195ms, GC pool 'G1 Old Generation' had collection(s): count=1 time=33878ms

Expert Contributor

Hello

 

Please look out for Cloudera Manager 7.4.1 update, which comes with the fix to the issue you are facing

 

The issue is because producer client ids that are not configured correctly and are generating huge metrics from anonymous producers

 

The solution proposed in CM 7.4.1 is to make the Kafka producer metric whitelist configurable in the CM UI

Thanks @Daming Xue . Surely ill look into upgrading the CM version. 

 

But for the time being, do we have any other solution as this is in production. We don't want to take risks by directly upgrading the production manager service. 

 

There are so many entities that are being monitored ex:- impala, yarn, etc, but we use only Kafka, and mirror maker. Is there any ways to disable it or does the service are necessary for running kafka clusters?. 

Expert Contributor

Hello

 

if the cluster is critical to your business, you should consider to get the subscription from Cloudera, and for your facing issues, Cloudera can create a patch by backporting the fix to an earlier CM version 

; ;