Reply
Explorer
Posts: 7
Registered: ‎06-05-2017

Oozie server is getting in bad health- OOZIE_SERVER_WEB_METRIC_COLLECTION

we have two oozie instances running out of which one instance is going bad once in a day with below message.

 

OOZIE_SERVER_WEB_METRIC_COLLECTION

Role health test bad

Critical

The health test result for OOZIE_SERVER_WEB_METRIC_COLLECTION has become bad: The Cloudera Manager Agent is not able to communicate with this role's web server.

 

 

2017-10-31 13:12:08,816 INFO com.cloudera.cmon.firehose.polling.oozie.OozieServerStateFetcher: Could not access Oozie Server oozie-OOZIE_SERVER-b28b7b48e807ce7c78f0ea0a52c0f67aMetricsInstrumentationService. Will attempt to access Instrumentation Service end-point.
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.ChunkedInputStream.readAheadBlocking(ChunkedInputStream.java:552)
at sun.net.www.http.ChunkedInputStream.readAhead(ChunkedInputStream.java:609)
at sun.net.www.http.ChunkedInputStream.read(ChunkedInputStream.java:696)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:3335)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.loadMore(UTF8StreamJsonParser.java:174)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._skipWSOrEnd(UTF8StreamJsonParser.java:2489)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:626)
at com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:192)
at com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:197)
at com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:197)
at com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:58)
at com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:15)
at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:2796)
at com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:1627)
at com.cloudera.cmon.JsonMetricsExtractor.extractMetrics(JsonMetricsExtractor.java:227)
at com.cloudera.cmon.firehose.polling.oozie.OozieMetricsServiceFetcher.fetch(OozieMetricsServiceFetcher.java:259)
at com.cloudera.cmon.firehose.polling.oozie.OozieServerStateFetcher.tryFetchFromBothEndPoints(OozieServerStateFetcher.java:311)
at com.cloudera.cmon.firehose.polling.oozie.OozieServerStateFetcher.updateOozieMetrics(OozieServerStateFetcher.java:247)
at com.cloudera.cmon.firehose.polling.oozie.OozieServerStateFetcher.doWork(OozieServerStateFetcher.java:198)
at com.cloudera.cmon.firehose.polling.oozie.OozieServerStateFetcher.doWork(OozieServerStateFetcher.java:54)
at com.cloudera.cmon.firehose.polling.CdhTask$InstrumentedWork.doWork(CdhTask.java:230)
at com.cloudera.cmf.cdhclient.CdhExecutor$1.call(CdhExecutor.java:125)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2017-10-31 13:12:08,818 WARN com.cloudera.cmon.firehose.polling.oozie.OozieServerStateFetcher: Could not retrieve oozie metrics for oozie-OOZIE_SERVER-b28b7b48e807ce7c78f0ea0a52c0f67a
java.io.IOException: Server returned HTTP response code: 503 for URL: http://localhost:11000/oozie/v2/admin/instrumentation
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1839)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1440)
at com.cloudera.enterprise.UrlUtil.readUrlWithTimeouts(UrlUtil.java:69)
at com.cloudera.cmon.firehose.polling.oozie.OozieInstrumentationServiceFetcher.getInputStream(OozieInstrumentation

 

 

Please let me know how to fix this?

Cloudera Employee
Posts: 219
Registered: ‎03-23-2015

Re: Oozie server is getting in bad health- OOZIE_SERVER_WEB_METRIC_COLLECTION

Have you checked what error do you see in the Ooize server log?

This looks like CM was not able to access Oozie for some reason, the Oozie server log might give you some clue.
Explorer
Posts: 7
Registered: ‎06-05-2017

Re: Oozie server is getting in bad health- OOZIE_SERVER_WEB_METRIC_COLLECTION

Hi Eric,

 

Thanks for your prompt response. I couldn't find anything in oozie logs either Errors/warns. could you please suggest next action plan? to identify the root cause?

Cloudera Employee
Posts: 219
Registered: ‎03-23-2015

Re: Oozie server is getting in bad health- OOZIE_SERVER_WEB_METRIC_COLLECTION

How much heap does Oozie has? Have you noticed GC hangs in the Oozie server. That might hang the Oozie process and potentially causing the timeout on client connection from CM.

Worth checking this to rule out.
Explorer
Posts: 7
Registered: ‎06-05-2017

Re: Oozie server is getting in bad health- OOZIE_SERVER_WEB_METRIC_COLLECTION

we have given 4g heap for oozie. during the alert heap reaches to 1.3 GB oN both oozie server instances.

As i said earlier, we have two oozie instances out of which only on one instance we are getting WEB_SERVER_STATUS_BAD.
Explorer
Posts: 7
Registered: ‎06-05-2017

Re: Oozie server is getting in bad health- OOZIE_SERVER_WEB_METRIC_COLLECTION

FYI, we are getting alerts for both oozie servers. Please let me know is there any thing needs to be checked.
Highlighted
Explorer
Posts: 7
Registered: ‎06-05-2017

Re: Oozie server is getting in bad health- OOZIE_SERVER_WEB_METRIC_COLLECTION

[ Edited ]

<property>
<name>oozie.poller.timeout.millis</name>
<value>20000</value>
</property>

should i add above configuration property in cmon.conf ? Cloudera mentioned issue was fixed in CDH 5.4.5 and we are using CDH 5.8.3. Please suggest

Announcements