- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
SCM fails to start: Error getting predicates
- Labels:
-
Cloudera Manager
Created on 11-19-2014 08:40 AM - last edited on 11-05-2019 08:22 AM by Robert Justice
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
today CDH5.1.3 has suddenly stopped working. Health monitoring no longer works, but I can acces Cloudera Manager (web). Well, first things first, I have decided to take a look at cloudera-scm-server.log and here is the output:
2014-11-19 16:20:24,106 INFO [1310736637@agentServer-0:components.StalenessChecker@69] No staleness check scheduled, scheduling one in 30 seconds 2014-11-19 16:20:32,103 ERROR [WebServerImpl:cmf.TsqueryAutoCompleter@391] Error getting predicates org.apache.avro.AvroRemoteException: java.net.ConnectException: Connection refused at org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:88) at com.sun.proxy.$Proxy94.getImpalaFilterMetadata(Unknown Source) at com.cloudera.cmf.protocol.firehose.nozzle.TimeoutNozzleIPC.getImpalaFilterMetadata(TimeoutNozzleIPC.java:377) at com.cloudera.server.web.cmf.impala.components.ImpalaDao.fetchFilterMetadata(ImpalaDao.java:688) at com.cloudera.server.web.cmf.work.AbstractWorkDao.getAndUpdateAutoCompleter(AbstractWorkDao.java:117) at com.cloudera.server.web.cmf.TsqueryAutoCompleter.<init>(TsqueryAutoCompleter.java:181) at com.cloudera.server.web.cmf.charts.TimeSeriesQueryController.initialize(TimeSeriesQueryController.java:96) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleElement.invoke(InitDestroyAnnotationBeanPostProcessor.java:340) at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleMetadata.invokeInitMethods(InitDestroyAnnotationBeanPostProcessor.java:293) at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor.postProcessBeforeInitialization(InitDestroyAnnotationBeanPostProcessor.java:130) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.applyBeanPostProcessorsBeforeInitialization(AbstractAutowireCapableBeanFactory.java:394) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1413) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:519) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:456) at org.springframework.beans.factory.support.AbstractBeanFactory$1.getObject(AbstractBeanFactory.java:293) at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:222) at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:290) at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:192) at org.springframework.beans.factory.support.DefaultListableBeanFactory.preInstantiateSingletons(DefaultListableBeanFactory.java:585) at org.springframework.context.support.AbstractApplicationContext.finishBeanFactoryInitialization(AbstractApplicationContext.java:895) at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:425) at org.springframework.web.servlet.FrameworkServlet.createWebApplicationContext(FrameworkServlet.java:467) at org.springframework.web.servlet.FrameworkServlet.createWebApplicationContext(FrameworkServlet.java:483) at org.springframework.web.servlet.FrameworkServlet.initWebApplicationContext(FrameworkServlet.java:358) at org.springframework.web.servlet.FrameworkServlet.initServletBean(FrameworkServlet.java:325) at org.springframework.web.servlet.HttpServletBean.init(HttpServletBean.java:127) at javax.servlet.GenericServlet.init(GenericServlet.java:241) at org.mortbay.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:440) at org.mortbay.jetty.servlet.ServletHolder.doStart(ServletHolder.java:263) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:736) at org.mortbay.jetty.servlet.Context.startContext(Context.java:140) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at com.cloudera.server.cmf.WebServerImpl.run(WebServerImpl.java:277) Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.<init>(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850) at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1091) at org.apache.avro.ipc.HttpTransceiver.writeBuffers(HttpTransceiver.java:71) at org.apache.avro.ipc.Transceiver.transceive(Transceiver.java:58) at org.apache.avro.ipc.Transceiver.transceive(Transceiver.java:72) at org.apache.avro.ipc.Requestor.request(Requestor.java:147) at org.apache.avro.ipc.Requestor.request(Requestor.java:101) at org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:72) ... 40 more 2014-11-19 16:20:36,014 INFO [JvmPauseMonitor:debug.JvmPauseMonitor@236] Detected pause in JVM or host machine (e.g. a stop the world GC, or JVM not scheduled): paused approximately 1146ms: GC pool 'PS MarkSweep' had collection(s): count=1 time=1418ms, GC pool 'PS Scavenge' had collection(s): count=1 time=209ms 2014-11-19 16:20:36,016 INFO [JvmPauseMonitor:debug.JvmPauseMonitor@236] Detected pause in JVM or host machine (e.g. a stop the world GC, or JVM not scheduled): paused approximately 1545ms: GC pool 'PS MarkSweep' had collection(s): count=1 time=1418ms, GC pool 'PS Scavenge' had collection(s): count=1 time=209ms
So my question is - do you have any clue what might have gone wrong and where do I start?
Is there a verboseness/debug option?
Initially, (heap/ non java) memory settings were ~30% of the recommended ones. Now I set them to 100%. The issue persists.
Your help is much appreciated.
Gin
Created 11-19-2014 11:46 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The error is in connecting to the monitoring service. Are all of your Management Service roles up and running? Any interesting errors in the logs for those roles? I believe ServiceMonitor is the likely culprit here, but doesn't hurt to check the others.
Thanks,
Darren
Created 11-19-2014 02:04 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Darren,
thanks for the reply.
Well, two main errors: "connection refused" and "error while getting descriptow" (from web:7180):
####################AGENT: [19/Nov/2014 17:35:18 +0000] 3160 MonitorDaemon-Reporter throttling_logger ERROR (9 skipped) Error sending messages to firehose: mgmt-HOSTMONITOR-cd23de091d9f93f400336360b549bb6a Traceback (most recent call last): File "/usr/lib/cmf/agent/src/cmf/monitor/firehose.py", line 71, in _send self._port) File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/avro-1.6.3-py2.7.egg/avro/ipc.py", line 464, in __init__ self.conn.connect() File "/usr/lib/python2.7/httplib.py", line 757, in connect self.timeout, self.source_address) File "/usr/lib/python2.7/socket.py", line 571, in create_connection raise err error: [Errno 111] Connection refused ####################EVENTSERVER: 2014-11-19 16:03:24,265 WARN com.cloudera.cmf.BasicScmProxy: IOException while getting descriptor java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.<init>(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850) at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1091) at com.cloudera.cmf.BasicScmProxy.authenticate(BasicScmProxy.java:188) at com.cloudera.cmf.BasicScmProxy.authenticateAndFetchScmDescriptor(BasicScmProxy.java:301) at com.cloudera.cmf.BasicScmProxy.getScmDescriptor(BasicScmProxy.java:346) at com.cloudera.cmf.BasicScmProxy.getScmDescriptor(BasicScmProxy.java:326) at com.cloudera.cmf.eventcatcher.server.EventCatcherService.main(EventCatcherService.java:100) 2014-11-19 16:03:24,286 WARN com.cloudera.cmf.eventcatcher.server.EventCatcherService: No descriptor fetched from http://ip-10-0-1-1.eu-west-1.compute.internal:7180 on after 1 tries, sleeping... 2014-11-19 16:03:24,421 WARN com.cloudera.cmf.event.publish.EventStorePublisherWithRetry: Failed to publish event: SimpleEvent{attributes={STACKTRACE=[java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.<init>(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850) at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1091) at com.cloudera.cmf.BasicScmProxy.authenticate(BasicScmProxy.java:188) at com.cloudera.cmf.BasicScmProxy.authenticateAndFetchScmDescriptor(BasicScmProxy.java:301) at com.cloudera.cmf.BasicScmProxy.getScmDescriptor(BasicScmProxy.java:346) at com.cloudera.cmf.BasicScmProxy.getScmDescriptor(BasicScmProxy.java:326) at com.cloudera.cmf.eventcatcher.server.EventCatcherService.main(EventCatcherService.java:100) ####################FIREHOSE: 2014-11-19 16:03:32,084 WARN com.cloudera.cmon.firehose.Main: No descriptor fetched from http://ip-10-0-1-1.eu-west-1.compute.internal:7180 on after 5 tries, sleeping... 2014-11-19 16:03:34,085 ERROR com.cloudera.cmon.firehose.Main: Could not fetch descriptor after 5 tries, exiting. ####################Postgres: LOG: unexpected EOF on client connection
Another interesting point - logs. In CMS they have localhost keyword, e.g.:
/var/log/cloudera-scm-eventserver/mgmt-cmf-mgmt-EVENTSERVER-localhost.log.out
But on the filesystem for some reason they use the actul IP ..."EVENTSERVER-ip-10-0-1-1.eu-west-1.compute.internal.log.out".
Don't know if this is how it should be, but it worked just fine two days ago.
In worst case, is there a risk of losing data if I reinstall cloudera manager and add existing cluster services to it?
Created 11-19-2014 04:48 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It sounds from the log like ServiceMonitor died. Can you answer my previous question about your management roles and whether each of them are running? If not, what happens when you restart them?
I wonder if something is weird with your networking, where SMon can't talk to CM server, so it can't do its work and connections to SMon therefore fail. Or maybe you just need to restart them.
Is this a the right CM URL when inside your cluster?
http://ip-10-0-1-1.eu-west-1.compute.internal:7180
Thanks,
Darren
Created 11-20-2014 12:52 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"It sounds from the log like ServiceMonitor died. Can you answer my previous question about your management roles and whether each of them are running?"
-All roles are running.
"If not, what happens when you restart them?"
-Restart doesn't change anything. All roles report one of the two errors: "connection refused" and "error while getting descriptor" (from web:7180). The code snippet from the previous post includes error messages from:
All is ok in:
/var/log/cloudera-scm-firehose/mgmt-cmf-mgmt-SERVICEMONITOR-localhost.log.out
An error in:
/var/log/cloudera-scm-firehose/mgmt-cmf-mgmt-HOSTMONITOR-localhost.log.out
2014-11-20 07:01:21,528 WARN com.cloudera.cmon.firehose.HMONToSMONHostSubjectRecordPublisher: Failed to send messages to SMON. java.lang.reflect.UndeclaredThrowableException at com.sun.proxy.$Proxy19.writeStatusRecords(Unknown Source) at com.cloudera.cmon.firehose.BasicFirehoseClient.writeStatusRecords(BasicFirehoseClient.java:74) at com.cloudera.cmon.firehose.HMONToSMONHostSubjectRecordPublisher.processRecords(HMONToSMONHostSubjectRecordPublisher.java:106) at com.cloudera.cmon.tstore.leveldb.LDBSubjectRecordStore.write(LDBSubjectRecordStore.java:400) at com.cloudera.cmon.kaiser.HMONTestRunner.runHostTestsForSession(HMONTestRunner.java:83) at com.cloudera.cmon.kaiser.HMONTestRunner.runTestsForSession(HMONTestRunner.java:65) at com.cloudera.cmon.kaiser.BaseTestRunner.runTestsOnAllSubjects(BaseTestRunner.java:148) at com.cloudera.cmon.kaiser.KaiserService$KaiserServiceRunner.run(KaiserService.java:179) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.avro.AvroRemoteException: java.net.ConnectException: Connection refused at org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:88) ... 9 more Caused by: java.net.ConnectException: Connection refused
EventServer problem: /var/log/cloudera-scm-eventserver/mgmt-cmf-mgmt-EVENTSERVER-localhost.log.out
2014-11-20 08:33:15,531 ERROR com.cloudera.cmf.eventcatcher.server.EventMetricsPublisher: Could not publish metrics to HMON: java.lang.reflect.UndeclaredThrowableException at com.sun.proxy.$Proxy19.writeMetrics(Unknown Source) at com.cloudera.cmon.firehose.BasicFirehoseClient.writeMetrics(BasicFirehoseClient.java:86) at com.cloudera.cmf.eventcatcher.server.EventMetricsPublisher.publishToHMON(EventMetricsPublisher.java:173) at com.cloudera.cmf.eventcatcher.server.EventMetricsPublisher.run(EventMetricsPublisher.java:103) at com.cloudera.enterprise.PeriodicEnterpriseService$UnexceptionablePeriodicRunnable.run(PeriodicEnterpriseService.java:67) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.avro.AvroRemoteException: java.net.ConnectException: Connection refused at org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:88) ... 6 more Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method)
What are these SMON and HMON? Are these databases or service some names?
My db.properties has these databases: scm, amon, rman, nav (NO smon or hmon).
I an connect to all databases manually using psql.
"Is this a the right CM URL when inside your cluster?"
-yep, I can access the web manager.
The biggest issue is that error messages and exceptions are not helpful at all. They do not provide any debug information/traces, but merely say that an "error has occurred".
Created 11-20-2014 02:16 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok, so what I have done, I reinstalled Cloudera manager roles.
Now the monitor works, but gives health issues:
"WARNING hostname ip-10-0-1-171.compute.internal differs from the canonical name localhost".
/etc/hosts contents:
127.0.0.1 localhost
10.0.1.171 ip-10-0-1-171.compute.internal
Is there a probem with my hosts file?
Created 11-20-2014 03:04 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Another problem: when trying to download a full log file via CLoudera Manager I get:
HTTP ERROR 502
Problem accessing /cmf/process/all/logs/download. Reason:
Connection refused Could not connect to host.
I thought to debug the jetty, but it overrides logging settings:
"-Dlog4j.configuration=file:/etc/cloudera-scm-server/log4j.properties -Dcmf.root.logger=INFO,LOGFILE"
and I don't know where the startup settings are kept.
Created 11-20-2014 11:33 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That warning about the hostname seems to likely be causing these issues. You could try removing the 2nd line of your /etc/hosts file, but frankly I'm not an expert on how that file works.
Created 11-20-2014 03:49 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Finally, the issue has been solved!
Indeed, Darren, the problem was in the hostname resolution.
We looked at the database HOSTS table and instead of fqdn one entry was "localhost"; another node pointed to 127.0.0.1!
Changing the values solved the issue. Be aware that agent restart updates the values again, thus /etc/hosts must be correctly set.
The majority of service cofnigurations were corrupted, having "localhost" instead of a remote fqdn.
I don't know how this could have happened out of the blue.
Thanks, Darren 🙂
Case closed!,
Gin.
Created 11-20-2014 09:15 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
One follow up comment;
The "host inspector" that is in the "hosts" section of the CM UI is a critical tool to be using to validate cluster configurations, it does a good job of identifying DNS/hostname/reverse lookup issues present on cluster nodes.
The "loopback" (127.0.0.1) entry in /etc/hosts should only contain "localhost" references and never the host name (or fqdn) value.
For reference's sake, this is discussed here: