Created 04-13-2021 12:49 PM
I am seeing frequent Cloudera Manager Service Monitor outages:
SERVICE_MONITOR_PAUSE_DURATION has become bad: Average time spent paused was 39.5 second(s) (65.76%) per minute over the previous 5 minute(s).
despite increasing the heap size to 7g and the 'off-heap' size to 24g. The machine often sees a high load (a NodeManager is also on the same machine), like 90 on a 24 core machine, so I suspect it might be starved of cpu when doing aggregation. The process regularly has +700 files open.
I have motivation to fix this issue since it causes data loss in the time series because SM pulls data and misses data points for +15 minutes at times.
The WARN:
AggregatingTimeSeriesStore: run duration exceeded desired period
is correlated with the above.
Is there a documented procedure to move Service Monitor to another machine while keeping existing data?
Perhaps like:
0. stop SM to quiesce changes to /var/lib/cloudera-service-monitor/ts/
1. using CM, redefine SM on another host
2. move /var/lib/cloudera-service-monitor/ts/ contents before starting SM
3. start SM
SM uses LevelDB, but I don't know the internals of that and whether /var/lib/cloudera-service-monitor/ts/ can just be moved. I don't want to lose the 1 month of history I have.
Created 04-16-2021 06:19 PM
Update: I moved SM to a host that has an typical load of 7-8 instead of 24. After a day on the new machine, there have been no alerts generated about SM being slow and no gaps in charts.
Conclusion: The problem was SM works best on a machine with low load.
Created 04-13-2021 02:30 PM
Some more info:
I see WARNs like:
JvmPauseMonitor: Detected pause in JVM or host machine (e.g. a stop the world GC, or JVM not scheduled): paused approximately 28577ms
but gcutil is:
S0 S1 E O M CCS YGC YGCT FGC FGCT GCT
0.00 63.51 80.24 7.41 97.94 94.86 5073 347.717 6 1.950 349.668
which shows old gen is only 7.41% used, so it is not out of heap. That means "JVM not scheduled" must be the condition.
Created 04-14-2021 06:05 PM
Update: The load went down to a reasonable level (24), so cpu starvation is not happening, but Service Monitor is still losing data from time to time with 5-30min gaps. The disk it is using is striped RAID and not used by YARN, so I don't think the issue can be disk performance.
Created 04-16-2021 06:19 PM
Update: I moved SM to a host that has an typical load of 7-8 instead of 24. After a day on the new machine, there have been no alerts generated about SM being slow and no gaps in charts.
Conclusion: The problem was SM works best on a machine with low load.