Support Questions

Find answers, ask questions, and share your expertise

"SERVICE_MONITOR_PAUSE_DURATION has become bad " despite heap increase

avatar
Explorer

I am seeing frequent Cloudera Manager Service Monitor outages:

 

SERVICE_MONITOR_PAUSE_DURATION has become bad: Average time spent paused was 39.5 second(s) (65.76%) per minute over the previous 5 minute(s).

 

despite increasing the heap size to 7g and the 'off-heap' size to 24g. The machine often sees a high load (a NodeManager is also on the same machine), like 90 on a 24 core machine, so I suspect it might be starved of cpu when doing aggregation. The process regularly has +700 files open. 

 

I have motivation to fix this issue since it causes data loss in the time series because SM pulls data  and misses data points for +15 minutes at times. 

 

The WARN: 

AggregatingTimeSeriesStore: run duration exceeded desired period

is correlated with the above. 

 

Is there a documented procedure to move Service Monitor to another machine while keeping existing data? 

 

Perhaps like:

0. stop SM to quiesce changes to /var/lib/cloudera-service-monitor/ts/

1. using CM, redefine SM on another host
2. move  /var/lib/cloudera-service-monitor/ts/  contents before starting SM

3. start SM

 

SM uses LevelDB, but I don't know the internals of that and whether /var/lib/cloudera-service-monitor/ts/ can just be moved. I don't want to lose the 1 month of history I have. 

 

1 ACCEPTED SOLUTION

avatar
Explorer

Update: I moved SM to a host that has an typical load of 7-8 instead of 24. After a day on the new machine, there have been no alerts generated about SM being slow and no gaps in charts. 

 

Conclusion: The problem was SM works best on a machine with low load.

 

View solution in original post

3 REPLIES 3

avatar
Explorer

Some more info: 

 

I see WARNs like:

JvmPauseMonitor: Detected pause in JVM or host machine (e.g. a stop the world GC, or JVM not scheduled): paused approximately 28577ms

but gcutil is:

  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT   

  0.00  63.51  80.24   7.41  97.94  94.86   5073  347.717     6    1.950  349.668

which shows old gen is only 7.41% used, so it is not out of heap. That means "JVM not scheduled" must be the condition.  

 

 

avatar
Explorer

Update: The load went down to a reasonable level (24), so cpu starvation is not happening, but Service Monitor is still losing data from time to time with 5-30min gaps. The disk it is using is striped RAID and not used by YARN, so I don't think the issue can be disk performance. 

 

 

avatar
Explorer

Update: I moved SM to a host that has an typical load of 7-8 instead of 24. After a day on the new machine, there have been no alerts generated about SM being slow and no gaps in charts. 

 

Conclusion: The problem was SM works best on a machine with low load.