Support Questions

pbaclace · ‎04-13-2021

I am seeing frequent Cloudera Manager Service Monitor outages:

SERVICE_MONITOR_PAUSE_DURATION has become bad: Average time spent paused was 39.5 second(s) (65.76%) per minute over the previous 5 minute(s).

despite increasing the heap size to 7g and the 'off-heap' size to 24g. The machine often sees a high load (a NodeManager is also on the same machine), like 90 on a 24 core machine, so I suspect it might be starved of cpu when doing aggregation. The process regularly has +700 files open.

I have motivation to fix this issue since it causes data loss in the time series because SM pulls data and misses data points for +15 minutes at times.

The WARN:

AggregatingTimeSeriesStore: run duration exceeded desired period

is correlated with the above.

Is there a documented procedure to move Service Monitor to another machine while keeping existing data?

Perhaps like:

0. stop SM to quiesce changes to /var/lib/cloudera-service-monitor/ts/

1. using CM, redefine SM on another host
2. move /var/lib/cloudera-service-monitor/ts/ contents before starting SM

3. start SM

SM uses LevelDB, but I don't know the internals of that and whether /var/lib/cloudera-service-monitor/ts/ can just be moved. I don't want to lose the 1 month of history I have.

pbaclace · ‎04-16-2021

Update: I moved SM to a host that has an typical load of 7-8 instead of 24. After a day on the new machine, there have been no alerts generated about SM being slow and no gaps in charts.

Conclusion: The problem was SM works best on a machine with low load.

View solution in original post

pbaclace · ‎04-13-2021

Some more info:

I see WARNs like:

JvmPauseMonitor: Detected pause in JVM or host machine (e.g. a stop the world GC, or JVM not scheduled): paused approximately 28577ms

but gcutil is:

S0 S1 E O M CCS YGC YGCT FGC FGCT GCT

0.00 63.51 80.24 7.41 97.94 94.86 5073 347.717 6 1.950 349.668

which shows old gen is only 7.41% used, so it is not out of heap. That means "JVM not scheduled" must be the condition.

pbaclace · ‎04-14-2021

Update: The load went down to a reasonable level (24), so cpu starvation is not happening, but Service Monitor is still losing data from time to time with 5-30min gaps. The disk it is using is striped RAID and not used by YARN, so I don't think the issue can be disk performance.

pbaclace · ‎04-16-2021

Update: I moved SM to a host that has an typical load of 7-8 instead of 24. After a day on the new machine, there have been no alerts generated about SM being slow and no gaps in charts.

Conclusion: The problem was SM works best on a machine with low load.

Cloudera Community

Support Questions

"SERVICE_MONITOR_PAUSE_DURATION has become bad " despite heap increase