Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Service Monitor was failing frequently

Service Monitor was failing frequently

Explorer

Hello ,

 

In my Cloudera Hadoop Cluster service monitor was failing frequently from last 2 day's and everytime i can see the logs like,

 

Apr 2, 4:09:11.402 AM ERROR com.cloudera.cmon.kaiser.BaseTestRunner
Error running subject health tests
java.util.concurrent.ExecutionException: java.lang.NullPointerException
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:188)
at com.cloudera.cmon.kaiser.BaseTestRunner.submitTestsOnSubjectsByType(BaseTestRunner.java:232)
at com.cloudera.cmon.kaiser.SMONTestRunner.runRoleAndServiceTestsForSession(SMONTestRunner.java:166)
at com.cloudera.cmon.kaiser.SMONTestRunner.runTestsForSession(SMONTestRunner.java:137)
at com.cloudera.cmon.kaiser.BaseTestRunner.runTestsOnAllSubjects(BaseTestRunner.java:143)
at com.cloudera.cmon.kaiser.KaiserService$KaiserServiceRunner.run(KaiserService.java:138)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at com.cloudera.cmon.kaiser.KaiserSubjectRecordFactory.extendStatusForSpecialRoles(KaiserSubjectRecordFactory.java:732)
at com.cloudera.cmon.kaiser.KaiserSubjectRecordFactory.createForRole(KaiserSubjectRecordFactory.java:485)
at com.cloudera.cmon.kaiser.BaseTestRunner$2.run(BaseTestRunner.java:295)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
... 1 more

 

 

I have increased the heap memory of service monitor "Java Heap Size of Service Monitor 5GB to 8GB " and "Maximum Non-Java Memory of Service Monitor 12 GB to 16 GB"

 

Can some please provide me a solution for this frequent failure's.

 

 

Thanks in advance,

Vinod

6 REPLIES 6

Re: Service Monitor was failing frequently

Expert Contributor

Hi @kvinod,

 

What version of CM are you using?

 

 

Regards,

Manu.

Re: Service Monitor was failing frequently

Explorer

Hi Manuroman,

 

We are using CDH 5.4.10 and cloudera manager version is 5.8.0.

 

Thanks,

Vinod

Re: Service Monitor was failing frequently

Explorer

Hello

 

Any body can help me on this issue ?

 

Thanks,

Vinod

Highlighted

Re: Service Monitor was failing frequently

Super Guru

@kvinod,

 

The stack traces you are seeing are the same that used to occur due to:

 

https://issues.apache.org/jira/browse/HADOOP-11361

 

Essentially that bug in the JMX response caused an exception in the agent when it attempted to retrieve the JMX JSON response.  In turn, when the Service Monitor attempted to ingest the information uploaded by the agent, it was null when non-null was expected.

 

This issue was fixed in CDH 5.9.0 and the fix exists in all later releases.

 

If you can check to see on the host where any role showing bad health is running if the agent log contains exceptions similar to:

File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/simplejson-2.1.2-py2.6-linux-x86_64.egg/simplejson/decoder.py", line 418, in raw_decode
    obj, end = self.scan_once(s, idx)
JSONDecodeError: Expecting object: line 866 column 568 (char 121642)

That will help confirm you are hitting the bug. 

Re: Service Monitor was failing frequently

Explorer

Hello bgooley,

 

Thanks for your valuable responce and Yes i can see the logs like you mentioned below in one of my server agent logs for today's date,

 

 

 

Traceback (most recent call last):
File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/cmf-5.8.1-py2.6.egg/cmf/monitor/generic/metric_collectors.py", line 203, in _collect_and_parse_and_return
simplejson.load(opened_url))
File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/simplejson-2.1.2-py2.6-linux-x86_64.egg/simplejson/__init__.py", line 328, in load
use_decimal=use_decimal, **kw)
File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/simplejson-2.1.2-py2.6-linux-x86_64.egg/simplejson/__init__.py", line 384, in loads
return _default_decoder.decode(s)
File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/simplejson-2.1.2-py2.6-linux-x86_64.egg/simplejson/decoder.py", line 402, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/simplejson-2.1.2-py2.6-linux-x86_64.egg/simplejson/decoder.py", line 418, in raw_decode
obj, end = self.scan_once(s, idx)
JSONDecodeError: Unterminated string starting at: line 528 column 19 (char 73958)
[02/Apr/2019 01:59:13 +0000] 9875 Audit-Plugin navigator_plugin INFO stopping Audit Plugin for hbase-REGIONSERVER with pipelines [HBaseAuditAppender,RegionAuditCoProcessor]

 

 

NOTE: We are using cluster from last two years and recently we migrated to CentOS from RHEL, Was there any cause for these kind of issues ?

 

 

So is this issue with CM version ?

Do you recommend us to migrate CM 5.8 to CM 5.9.0 ? 

 

 

Thanks,

Vinod

Re: Service Monitor was failing frequently

Super Guru

@kvinod,

 

Recommendation here is a bit difficult as the latest release of CM and CDH is 6.2.  CDH would need to be upgraded to at least 5.7 in order for you to then upgrade to CDH 6.2

 

CDH 5.x branch is now at 5.16.1

 

It really depends on what you are willing to do regarding upgrading the cluster.

 

The issue is actually in CDH, not Cloudera Manager, so the upgrade would be of both to CDH 5.9.0 at the very least.

 

If you are not ready to make the leap to C6 yet, I would encourage you to consider upgrading to CM/CDH 5.16.1 as your versions are not being maintained any longer and will receive no more fixes.

 

Of course, upgrading can require some planning, so make sure to look through our upgrade documentation:

 

https://www.cloudera.com/documentation/enterprise/upgrade/topics/ug_overview.html

 

In general, you would upgrade Cloudera Manager first and then CDH.

 

Since the fix is in a later release of CDH than you have, upgrading would be the way to fix this particular bug.

Don't have an account?
Coming from Hortonworks? Activate your account here