Created on 11-19-2014 08:40 AM - last edited on 11-05-2019 08:22 AM by Robert Justice
Hello,
today CDH5.1.3 has suddenly stopped working. Health monitoring no longer works, but I can acces Cloudera Manager (web). Well, first things first, I have decided to take a look at cloudera-scm-server.log and here is the output:
2014-11-19 16:20:24,106 INFO [1310736637@agentServer-0:components.StalenessChecker@69] No staleness check scheduled, scheduling one in 30 seconds 2014-11-19 16:20:32,103 ERROR [WebServerImpl:cmf.TsqueryAutoCompleter@391] Error getting predicates org.apache.avro.AvroRemoteException: java.net.ConnectException: Connection refused at org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:88) at com.sun.proxy.$Proxy94.getImpalaFilterMetadata(Unknown Source) at com.cloudera.cmf.protocol.firehose.nozzle.TimeoutNozzleIPC.getImpalaFilterMetadata(TimeoutNozzleIPC.java:377) at com.cloudera.server.web.cmf.impala.components.ImpalaDao.fetchFilterMetadata(ImpalaDao.java:688) at com.cloudera.server.web.cmf.work.AbstractWorkDao.getAndUpdateAutoCompleter(AbstractWorkDao.java:117) at com.cloudera.server.web.cmf.TsqueryAutoCompleter.<init>(TsqueryAutoCompleter.java:181) at com.cloudera.server.web.cmf.charts.TimeSeriesQueryController.initialize(TimeSeriesQueryController.java:96) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleElement.invoke(InitDestroyAnnotationBeanPostProcessor.java:340) at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleMetadata.invokeInitMethods(InitDestroyAnnotationBeanPostProcessor.java:293) at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor.postProcessBeforeInitialization(InitDestroyAnnotationBeanPostProcessor.java:130) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.applyBeanPostProcessorsBeforeInitialization(AbstractAutowireCapableBeanFactory.java:394) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1413) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:519) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:456) at org.springframework.beans.factory.support.AbstractBeanFactory$1.getObject(AbstractBeanFactory.java:293) at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:222) at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:290) at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:192) at org.springframework.beans.factory.support.DefaultListableBeanFactory.preInstantiateSingletons(DefaultListableBeanFactory.java:585) at org.springframework.context.support.AbstractApplicationContext.finishBeanFactoryInitialization(AbstractApplicationContext.java:895) at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:425) at org.springframework.web.servlet.FrameworkServlet.createWebApplicationContext(FrameworkServlet.java:467) at org.springframework.web.servlet.FrameworkServlet.createWebApplicationContext(FrameworkServlet.java:483) at org.springframework.web.servlet.FrameworkServlet.initWebApplicationContext(FrameworkServlet.java:358) at org.springframework.web.servlet.FrameworkServlet.initServletBean(FrameworkServlet.java:325) at org.springframework.web.servlet.HttpServletBean.init(HttpServletBean.java:127) at javax.servlet.GenericServlet.init(GenericServlet.java:241) at org.mortbay.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:440) at org.mortbay.jetty.servlet.ServletHolder.doStart(ServletHolder.java:263) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:736) at org.mortbay.jetty.servlet.Context.startContext(Context.java:140) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at com.cloudera.server.cmf.WebServerImpl.run(WebServerImpl.java:277) Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.<init>(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850) at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1091) at org.apache.avro.ipc.HttpTransceiver.writeBuffers(HttpTransceiver.java:71) at org.apache.avro.ipc.Transceiver.transceive(Transceiver.java:58) at org.apache.avro.ipc.Transceiver.transceive(Transceiver.java:72) at org.apache.avro.ipc.Requestor.request(Requestor.java:147) at org.apache.avro.ipc.Requestor.request(Requestor.java:101) at org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:72) ... 40 more 2014-11-19 16:20:36,014 INFO [JvmPauseMonitor:debug.JvmPauseMonitor@236] Detected pause in JVM or host machine (e.g. a stop the world GC, or JVM not scheduled): paused approximately 1146ms: GC pool 'PS MarkSweep' had collection(s): count=1 time=1418ms, GC pool 'PS Scavenge' had collection(s): count=1 time=209ms 2014-11-19 16:20:36,016 INFO [JvmPauseMonitor:debug.JvmPauseMonitor@236] Detected pause in JVM or host machine (e.g. a stop the world GC, or JVM not scheduled): paused approximately 1545ms: GC pool 'PS MarkSweep' had collection(s): count=1 time=1418ms, GC pool 'PS Scavenge' had collection(s): count=1 time=209ms
So my question is - do you have any clue what might have gone wrong and where do I start?
Is there a verboseness/debug option?
Initially, (heap/ non java) memory settings were ~30% of the recommended ones. Now I set them to 100%. The issue persists.
Your help is much appreciated.
Gin
Created 11-19-2014 11:46 AM
Created 11-19-2014 02:04 PM
Hi Darren,
thanks for the reply.
Well, two main errors: "connection refused" and "error while getting descriptow" (from web:7180):
####################AGENT: [19/Nov/2014 17:35:18 +0000] 3160 MonitorDaemon-Reporter throttling_logger ERROR (9 skipped) Error sending messages to firehose: mgmt-HOSTMONITOR-cd23de091d9f93f400336360b549bb6a Traceback (most recent call last): File "/usr/lib/cmf/agent/src/cmf/monitor/firehose.py", line 71, in _send self._port) File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/avro-1.6.3-py2.7.egg/avro/ipc.py", line 464, in __init__ self.conn.connect() File "/usr/lib/python2.7/httplib.py", line 757, in connect self.timeout, self.source_address) File "/usr/lib/python2.7/socket.py", line 571, in create_connection raise err error: [Errno 111] Connection refused ####################EVENTSERVER: 2014-11-19 16:03:24,265 WARN com.cloudera.cmf.BasicScmProxy: IOException while getting descriptor java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.<init>(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850) at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1091) at com.cloudera.cmf.BasicScmProxy.authenticate(BasicScmProxy.java:188) at com.cloudera.cmf.BasicScmProxy.authenticateAndFetchScmDescriptor(BasicScmProxy.java:301) at com.cloudera.cmf.BasicScmProxy.getScmDescriptor(BasicScmProxy.java:346) at com.cloudera.cmf.BasicScmProxy.getScmDescriptor(BasicScmProxy.java:326) at com.cloudera.cmf.eventcatcher.server.EventCatcherService.main(EventCatcherService.java:100) 2014-11-19 16:03:24,286 WARN com.cloudera.cmf.eventcatcher.server.EventCatcherService: No descriptor fetched from http://ip-10-0-1-1.eu-west-1.compute.internal:7180 on after 1 tries, sleeping... 2014-11-19 16:03:24,421 WARN com.cloudera.cmf.event.publish.EventStorePublisherWithRetry: Failed to publish event: SimpleEvent{attributes={STACKTRACE=[java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.<init>(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850) at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1091) at com.cloudera.cmf.BasicScmProxy.authenticate(BasicScmProxy.java:188) at com.cloudera.cmf.BasicScmProxy.authenticateAndFetchScmDescriptor(BasicScmProxy.java:301) at com.cloudera.cmf.BasicScmProxy.getScmDescriptor(BasicScmProxy.java:346) at com.cloudera.cmf.BasicScmProxy.getScmDescriptor(BasicScmProxy.java:326) at com.cloudera.cmf.eventcatcher.server.EventCatcherService.main(EventCatcherService.java:100) ####################FIREHOSE: 2014-11-19 16:03:32,084 WARN com.cloudera.cmon.firehose.Main: No descriptor fetched from http://ip-10-0-1-1.eu-west-1.compute.internal:7180 on after 5 tries, sleeping... 2014-11-19 16:03:34,085 ERROR com.cloudera.cmon.firehose.Main: Could not fetch descriptor after 5 tries, exiting. ####################Postgres: LOG: unexpected EOF on client connection
Another interesting point - logs. In CMS they have localhost keyword, e.g.:
/var/log/cloudera-scm-eventserver/mgmt-cmf-mgmt-EVENTSERVER-localhost.log.out
But on the filesystem for some reason they use the actul IP ..."EVENTSERVER-ip-10-0-1-1.eu-west-1.compute.internal.log.out".
Don't know if this is how it should be, but it worked just fine two days ago.
In worst case, is there a risk of losing data if I reinstall cloudera manager and add existing cluster services to it?
Created 11-19-2014 04:48 PM
Created 11-20-2014 12:52 AM
"It sounds from the log like ServiceMonitor died. Can you answer my previous question about your management roles and whether each of them are running?"
-All roles are running.
"If not, what happens when you restart them?"
-Restart doesn't change anything. All roles report one of the two errors: "connection refused" and "error while getting descriptor" (from web:7180). The code snippet from the previous post includes error messages from:
All is ok in:
/var/log/cloudera-scm-firehose/mgmt-cmf-mgmt-SERVICEMONITOR-localhost.log.out
An error in:
/var/log/cloudera-scm-firehose/mgmt-cmf-mgmt-HOSTMONITOR-localhost.log.out
2014-11-20 07:01:21,528 WARN com.cloudera.cmon.firehose.HMONToSMONHostSubjectRecordPublisher: Failed to send messages to SMON. java.lang.reflect.UndeclaredThrowableException at com.sun.proxy.$Proxy19.writeStatusRecords(Unknown Source) at com.cloudera.cmon.firehose.BasicFirehoseClient.writeStatusRecords(BasicFirehoseClient.java:74) at com.cloudera.cmon.firehose.HMONToSMONHostSubjectRecordPublisher.processRecords(HMONToSMONHostSubjectRecordPublisher.java:106) at com.cloudera.cmon.tstore.leveldb.LDBSubjectRecordStore.write(LDBSubjectRecordStore.java:400) at com.cloudera.cmon.kaiser.HMONTestRunner.runHostTestsForSession(HMONTestRunner.java:83) at com.cloudera.cmon.kaiser.HMONTestRunner.runTestsForSession(HMONTestRunner.java:65) at com.cloudera.cmon.kaiser.BaseTestRunner.runTestsOnAllSubjects(BaseTestRunner.java:148) at com.cloudera.cmon.kaiser.KaiserService$KaiserServiceRunner.run(KaiserService.java:179) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.avro.AvroRemoteException: java.net.ConnectException: Connection refused at org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:88) ... 9 more Caused by: java.net.ConnectException: Connection refused
EventServer problem: /var/log/cloudera-scm-eventserver/mgmt-cmf-mgmt-EVENTSERVER-localhost.log.out
2014-11-20 08:33:15,531 ERROR com.cloudera.cmf.eventcatcher.server.EventMetricsPublisher: Could not publish metrics to HMON: java.lang.reflect.UndeclaredThrowableException at com.sun.proxy.$Proxy19.writeMetrics(Unknown Source) at com.cloudera.cmon.firehose.BasicFirehoseClient.writeMetrics(BasicFirehoseClient.java:86) at com.cloudera.cmf.eventcatcher.server.EventMetricsPublisher.publishToHMON(EventMetricsPublisher.java:173) at com.cloudera.cmf.eventcatcher.server.EventMetricsPublisher.run(EventMetricsPublisher.java:103) at com.cloudera.enterprise.PeriodicEnterpriseService$UnexceptionablePeriodicRunnable.run(PeriodicEnterpriseService.java:67) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.avro.AvroRemoteException: java.net.ConnectException: Connection refused at org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:88) ... 6 more Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method)
What are these SMON and HMON? Are these databases or service some names?
My db.properties has these databases: scm, amon, rman, nav (NO smon or hmon).
I an connect to all databases manually using psql.
"Is this a the right CM URL when inside your cluster?"
-yep, I can access the web manager.
The biggest issue is that error messages and exceptions are not helpful at all. They do not provide any debug information/traces, but merely say that an "error has occurred".
Created 11-20-2014 02:16 AM
Ok, so what I have done, I reinstalled Cloudera manager roles.
Now the monitor works, but gives health issues:
"WARNING hostname ip-10-0-1-171.compute.internal differs from the canonical name localhost".
/etc/hosts contents:
127.0.0.1 localhost
10.0.1.171 ip-10-0-1-171.compute.internal
Is there a probem with my hosts file?
Created 11-20-2014 03:04 AM
Another problem: when trying to download a full log file via CLoudera Manager I get:
HTTP ERROR 502
Problem accessing /cmf/process/all/logs/download. Reason:
Connection refused Could not connect to host.
I thought to debug the jetty, but it overrides logging settings:
"-Dlog4j.configuration=file:/etc/cloudera-scm-server/log4j.properties -Dcmf.root.logger=INFO,LOGFILE"
and I don't know where the startup settings are kept.
Created 11-20-2014 11:33 AM
Created 11-20-2014 03:49 PM
Finally, the issue has been solved!
Indeed, Darren, the problem was in the hostname resolution.
We looked at the database HOSTS table and instead of fqdn one entry was "localhost"; another node pointed to 127.0.0.1!
Changing the values solved the issue. Be aware that agent restart updates the values again, thus /etc/hosts must be correctly set.
The majority of service cofnigurations were corrupted, having "localhost" instead of a remote fqdn.
I don't know how this could have happened out of the blue.
Thanks, Darren 🙂
Case closed!,
Gin.
Created 11-20-2014 09:15 PM
One follow up comment;
The "host inspector" that is in the "hosts" section of the CM UI is a critical tool to be using to validate cluster configurations, it does a good job of identifying DNS/hostname/reverse lookup issues present on cluster nodes.
The "loopback" (127.0.0.1) entry in /etc/hosts should only contain "localhost" references and never the host name (or fqdn) value.
For reference's sake, this is discussed here: