In my Cloudera Hadoop Cluster service monitor was failing frequently from last 2 day's and everytime i can see the logs like,
Apr 2, 4:09:11.402 AM ERROR com.cloudera.cmon.kaiser.BaseTestRunner
Error running subject health tests
Caused by: java.lang.NullPointerException
... 1 more
I have increased the heap memory of service monitor "Java Heap Size of Service Monitor 5GB to 8GB " and "Maximum Non-Java Memory of Service Monitor 12 GB to 16 GB"
Can some please provide me a solution for this frequent failure's.
Thanks in advance,
The stack traces you are seeing are the same that used to occur due to:
Essentially that bug in the JMX response caused an exception in the agent when it attempted to retrieve the JMX JSON response. In turn, when the Service Monitor attempted to ingest the information uploaded by the agent, it was null when non-null was expected.
This issue was fixed in CDH 5.9.0 and the fix exists in all later releases.
If you can check to see on the host where any role showing bad health is running if the agent log contains exceptions similar to:
File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/simplejson-2.1.2-py2.6-linux-x86_64.egg/simplejson/decoder.py", line 418, in raw_decode obj, end = self.scan_once(s, idx) JSONDecodeError: Expecting object: line 866 column 568 (char 121642)
That will help confirm you are hitting the bug.
Thanks for your valuable responce and Yes i can see the logs like you mentioned below in one of my server agent logs for today's date,
Traceback (most recent call last):
File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/cmf-5.8.1-py2.6.egg/cmf/monitor/generic/metric_collectors.py", line 203, in _collect_and_parse_and_return
File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/simplejson-2.1.2-py2.6-linux-x86_64.egg/simplejson/__init__.py", line 328, in load
File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/simplejson-2.1.2-py2.6-linux-x86_64.egg/simplejson/__init__.py", line 384, in loads
File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/simplejson-2.1.2-py2.6-linux-x86_64.egg/simplejson/decoder.py", line 402, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/simplejson-2.1.2-py2.6-linux-x86_64.egg/simplejson/decoder.py", line 418, in raw_decode
obj, end = self.scan_once(s, idx)
JSONDecodeError: Unterminated string starting at: line 528 column 19 (char 73958)
[02/Apr/2019 01:59:13 +0000] 9875 Audit-Plugin navigator_plugin INFO stopping Audit Plugin for hbase-REGIONSERVER with pipelines [HBaseAuditAppender,RegionAuditCoProcessor]
NOTE: We are using cluster from last two years and recently we migrated to CentOS from RHEL, Was there any cause for these kind of issues ?
So is this issue with CM version ?
Do you recommend us to migrate CM 5.8 to CM 5.9.0 ?
Recommendation here is a bit difficult as the latest release of CM and CDH is 6.2. CDH would need to be upgraded to at least 5.7 in order for you to then upgrade to CDH 6.2
CDH 5.x branch is now at 5.16.1
It really depends on what you are willing to do regarding upgrading the cluster.
The issue is actually in CDH, not Cloudera Manager, so the upgrade would be of both to CDH 5.9.0 at the very least.
If you are not ready to make the leap to C6 yet, I would encourage you to consider upgrading to CM/CDH 5.16.1 as your versions are not being maintained any longer and will receive no more fixes.
Of course, upgrading can require some planning, so make sure to look through our upgrade documentation:
In general, you would upgrade Cloudera Manager first and then CDH.
Since the fix is in a later release of CDH than you have, upgrading would be the way to fix this particular bug.