We got two CDH cluster with the same version(CDH-5.5.2-1.cdh5.5.2.p0.4), and both the ResourceManager of each cluster with the same configuration.
One of the ResourceManager is running well, and its heap memory is stay in a constant value(e.g 800mb) as the time is going on.
But the other one will throw OOM exception and exit after 15 days. When we use 'jmap -F -histo' to dump its jvm heap info, we are seeing that the size of object 'char' is growing up as the time is moving, and it finally throw OOM.
Following is key info of jvm dump result of both the good RM and OOM RM:
dump cmd：jmap -F -histo pid (The heap size of both the RM are 1GB)
A）jvm dump of good RM in cluster A
we are seeing that 40w+ char instances with 60m+ heap mem
B）jvm dump of bak RM（OOM） in cluster B
we are seeing that 30w+ char instances but with 400m+ heap mem
Any help wil be appreciated.
We dump(jmap -F -dump:file=file.dump_result pid) heap info today, and use MAT(memory analyzer tools) to analyse the dump file, we found that the instance variable applications(java.util.concurrent.ConcurrentHashMap) in org.apache.hadoop.yarn.server.resourcemanager.RMActiveServiceContext eats up a lot of memory: