Support Questions

Find answers, ask questions, and share your expertise

SCM fails to start: Error getting predicates

avatar
Explorer

Hello,

 

today CDH5.1.3 has suddenly stopped working. Health monitoring no longer works, but I can acces Cloudera Manager (web). Well, first things first, I have decided to take a look at cloudera-scm-server.log and here is the output:

 

2014-11-19 16:20:24,106  INFO [1310736637@agentServer-0:components.StalenessChecker@69] No staleness check scheduled, scheduling one in 30 seconds
2014-11-19 16:20:32,103 ERROR [WebServerImpl:cmf.TsqueryAutoCompleter@391] Error getting predicates
org.apache.avro.AvroRemoteException: java.net.ConnectException: Connection refused
	at org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:88)
	at com.sun.proxy.$Proxy94.getImpalaFilterMetadata(Unknown Source)
	at com.cloudera.cmf.protocol.firehose.nozzle.TimeoutNozzleIPC.getImpalaFilterMetadata(TimeoutNozzleIPC.java:377)
	at com.cloudera.server.web.cmf.impala.components.ImpalaDao.fetchFilterMetadata(ImpalaDao.java:688)
	at com.cloudera.server.web.cmf.work.AbstractWorkDao.getAndUpdateAutoCompleter(AbstractWorkDao.java:117)
	at com.cloudera.server.web.cmf.TsqueryAutoCompleter.<init>(TsqueryAutoCompleter.java:181)
	at com.cloudera.server.web.cmf.charts.TimeSeriesQueryController.initialize(TimeSeriesQueryController.java:96)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleElement.invoke(InitDestroyAnnotationBeanPostProcessor.java:340)
	at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleMetadata.invokeInitMethods(InitDestroyAnnotationBeanPostProcessor.java:293)
	at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor.postProcessBeforeInitialization(InitDestroyAnnotationBeanPostProcessor.java:130)
	at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.applyBeanPostProcessorsBeforeInitialization(AbstractAutowireCapableBeanFactory.java:394)
	at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1413)
	at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:519)
	at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:456)
	at org.springframework.beans.factory.support.AbstractBeanFactory$1.getObject(AbstractBeanFactory.java:293)
	at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:222)
	at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:290)
	at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:192)
	at org.springframework.beans.factory.support.DefaultListableBeanFactory.preInstantiateSingletons(DefaultListableBeanFactory.java:585)
	at org.springframework.context.support.AbstractApplicationContext.finishBeanFactoryInitialization(AbstractApplicationContext.java:895)
	at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:425)
	at org.springframework.web.servlet.FrameworkServlet.createWebApplicationContext(FrameworkServlet.java:467)
	at org.springframework.web.servlet.FrameworkServlet.createWebApplicationContext(FrameworkServlet.java:483)
	at org.springframework.web.servlet.FrameworkServlet.initWebApplicationContext(FrameworkServlet.java:358)
	at org.springframework.web.servlet.FrameworkServlet.initServletBean(FrameworkServlet.java:325)
	at org.springframework.web.servlet.HttpServletBean.init(HttpServletBean.java:127)
	at javax.servlet.GenericServlet.init(GenericServlet.java:241)
	at org.mortbay.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:440)
	at org.mortbay.jetty.servlet.ServletHolder.doStart(ServletHolder.java:263)
	at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
	at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:736)
	at org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
	at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282)
	at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
	at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
	at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
	at com.cloudera.server.cmf.WebServerImpl.run(WebServerImpl.java:277)
Caused by: java.net.ConnectException: Connection refused
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:579)
	at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
	at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
	at sun.net.www.http.HttpClient.New(HttpClient.java:308)
	at sun.net.www.http.HttpClient.New(HttpClient.java:326)
	at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932)
	at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850)
	at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1091)
	at org.apache.avro.ipc.HttpTransceiver.writeBuffers(HttpTransceiver.java:71)
	at org.apache.avro.ipc.Transceiver.transceive(Transceiver.java:58)
	at org.apache.avro.ipc.Transceiver.transceive(Transceiver.java:72)
	at org.apache.avro.ipc.Requestor.request(Requestor.java:147)
	at org.apache.avro.ipc.Requestor.request(Requestor.java:101)
	at org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:72)
	... 40 more
2014-11-19 16:20:36,014  INFO [JvmPauseMonitor:debug.JvmPauseMonitor@236] Detected pause in JVM or host machine (e.g. a stop the world GC, or JVM not scheduled): paused approximately 1146ms: GC pool 'PS MarkSweep' had collection(s): count=1 time=1418ms, GC pool 'PS Scavenge' had collection(s): count=1 time=209ms
2014-11-19 16:20:36,016  INFO [JvmPauseMonitor:debug.JvmPauseMonitor@236] Detected pause in JVM or host machine (e.g. a stop the world GC, or JVM not scheduled): paused approximately 1545ms: GC pool 'PS MarkSweep' had collection(s): count=1 time=1418ms, GC pool 'PS Scavenge' had collection(s): count=1 time=209ms

So my question is - do you have any clue what might have gone wrong and where do I start?
Is there a verboseness/debug option? 

Initially, (heap/ non java) memory settings were ~30% of the recommended ones. Now I set them to 100%. The issue persists.

Your help is much appreciated.
Gin

10 REPLIES 10

avatar
Hi Gin,

The error is in connecting to the monitoring service. Are all of your Management Service roles up and running? Any interesting errors in the logs for those roles? I believe ServiceMonitor is the likely culprit here, but doesn't hurt to check the others.

Thanks,
Darren

avatar
Explorer

Hi Darren,

 

thanks for the reply.

 

 

Well, two main errors: "connection refused" and "error while getting descriptow" (from web:7180):

 

####################AGENT:
[19/Nov/2014 17:35:18 +0000] 3160 MonitorDaemon-Reporter throttling_logger ERROR    (9 skipped) Error sending messages to firehose: mgmt-HOSTMONITOR-cd23de091d9f93f400336360b549bb6a
Traceback (most recent call last):
  File "/usr/lib/cmf/agent/src/cmf/monitor/firehose.py", line 71, in _send
    self._port)
  File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/avro-1.6.3-py2.7.egg/avro/ipc.py", line 464, in __init__
    self.conn.connect()
  File "/usr/lib/python2.7/httplib.py", line 757, in connect
    self.timeout, self.source_address)
  File "/usr/lib/python2.7/socket.py", line 571, in create_connection
    raise err
error: [Errno 111] Connection refused


####################EVENTSERVER:
2014-11-19 16:03:24,265 WARN com.cloudera.cmf.BasicScmProxy: IOException while getting descriptor
java.net.ConnectException: Connection refused
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:579)
	at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
	at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
	at sun.net.www.http.HttpClient.New(HttpClient.java:308)
	at sun.net.www.http.HttpClient.New(HttpClient.java:326)
	at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932)
	at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850)
	at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1091)
	at com.cloudera.cmf.BasicScmProxy.authenticate(BasicScmProxy.java:188)
	at com.cloudera.cmf.BasicScmProxy.authenticateAndFetchScmDescriptor(BasicScmProxy.java:301)
	at com.cloudera.cmf.BasicScmProxy.getScmDescriptor(BasicScmProxy.java:346)
	at com.cloudera.cmf.BasicScmProxy.getScmDescriptor(BasicScmProxy.java:326)
	at com.cloudera.cmf.eventcatcher.server.EventCatcherService.main(EventCatcherService.java:100)
2014-11-19 16:03:24,286 WARN com.cloudera.cmf.eventcatcher.server.EventCatcherService: No descriptor fetched from http://ip-10-0-1-1.eu-west-1.compute.internal:7180 on after 1 tries, sleeping...
2014-11-19 16:03:24,421 WARN com.cloudera.cmf.event.publish.EventStorePublisherWithRetry: Failed to publish event: SimpleEvent{attributes={STACKTRACE=[java.net.ConnectException: Connection refused
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:579)
	at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
	at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
	at sun.net.www.http.HttpClient.New(HttpClient.java:308)
	at sun.net.www.http.HttpClient.New(HttpClient.java:326)
	at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932)
	at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850)
	at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1091)
	at com.cloudera.cmf.BasicScmProxy.authenticate(BasicScmProxy.java:188)
	at com.cloudera.cmf.BasicScmProxy.authenticateAndFetchScmDescriptor(BasicScmProxy.java:301)
	at com.cloudera.cmf.BasicScmProxy.getScmDescriptor(BasicScmProxy.java:346)
	at com.cloudera.cmf.BasicScmProxy.getScmDescriptor(BasicScmProxy.java:326)
	at com.cloudera.cmf.eventcatcher.server.EventCatcherService.main(EventCatcherService.java:100)
	
####################FIREHOSE:
2014-11-19 16:03:32,084 WARN com.cloudera.cmon.firehose.Main: No descriptor fetched from http://ip-10-0-1-1.eu-west-1.compute.internal:7180 on after 5 tries, sleeping...
2014-11-19 16:03:34,085 ERROR com.cloudera.cmon.firehose.Main: Could not fetch descriptor after 5 tries, exiting.

####################Postgres:
LOG:  unexpected EOF on client connection	

 

Another interesting point - logs. In CMS they have localhost keyword, e.g.:

/var/log/cloudera-scm-eventserver/mgmt-cmf-mgmt-EVENTSERVER-localhost.log.out

But on the filesystem for some reason they use the actul IP ..."EVENTSERVER-ip-10-0-1-1.eu-west-1.compute.internal.log.out".

Don't know if this is how it should be, but it worked just fine two days ago.

 

 

 

In worst case, is there a risk of losing data if I reinstall cloudera manager and add existing cluster services to it? 

avatar
Reinstallation is unlikely to fix your issue.

It sounds from the log like ServiceMonitor died. Can you answer my previous question about your management roles and whether each of them are running? If not, what happens when you restart them?

I wonder if something is weird with your networking, where SMon can't talk to CM server, so it can't do its work and connections to SMon therefore fail. Or maybe you just need to restart them.

Is this a the right CM URL when inside your cluster?
http://ip-10-0-1-1.eu-west-1.compute.internal:7180

Thanks,
Darren

avatar
Explorer

"It sounds from the log like ServiceMonitor died. Can you answer my previous question about your management roles and whether each of them are running?"

-All roles are running.

 

"If not, what happens when you restart them?"

-Restart doesn't change anything. All roles report one of the two errors: "connection refused" and "error while getting descriptor" (from web:7180). The code snippet from the previous post includes error messages from: 

 

All is ok in:

/var/log/cloudera-scm-firehose/mgmt-cmf-mgmt-SERVICEMONITOR-localhost.log.out 

 

An error in: 

/var/log/cloudera-scm-firehose/mgmt-cmf-mgmt-HOSTMONITOR-localhost.log.out

2014-11-20 07:01:21,528 WARN com.cloudera.cmon.firehose.HMONToSMONHostSubjectRecordPublisher: Failed to send messages to SMON.
java.lang.reflect.UndeclaredThrowableException
	at com.sun.proxy.$Proxy19.writeStatusRecords(Unknown Source)
	at com.cloudera.cmon.firehose.BasicFirehoseClient.writeStatusRecords(BasicFirehoseClient.java:74)
	at com.cloudera.cmon.firehose.HMONToSMONHostSubjectRecordPublisher.processRecords(HMONToSMONHostSubjectRecordPublisher.java:106)
	at com.cloudera.cmon.tstore.leveldb.LDBSubjectRecordStore.write(LDBSubjectRecordStore.java:400)
	at com.cloudera.cmon.kaiser.HMONTestRunner.runHostTestsForSession(HMONTestRunner.java:83)
	at com.cloudera.cmon.kaiser.HMONTestRunner.runTestsForSession(HMONTestRunner.java:65)
	at com.cloudera.cmon.kaiser.BaseTestRunner.runTestsOnAllSubjects(BaseTestRunner.java:148)
	at com.cloudera.cmon.kaiser.KaiserService$KaiserServiceRunner.run(KaiserService.java:179)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.avro.AvroRemoteException: java.net.ConnectException: Connection refused
	at org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:88)
	... 9 more
Caused by: java.net.ConnectException: Connection refused

EventServer problem: /var/log/cloudera-scm-eventserver/mgmt-cmf-mgmt-EVENTSERVER-localhost.log.out

 

2014-11-20 08:33:15,531 ERROR com.cloudera.cmf.eventcatcher.server.EventMetricsPublisher: Could not publish metrics to HMON:
java.lang.reflect.UndeclaredThrowableException
	at com.sun.proxy.$Proxy19.writeMetrics(Unknown Source)
	at com.cloudera.cmon.firehose.BasicFirehoseClient.writeMetrics(BasicFirehoseClient.java:86)
	at com.cloudera.cmf.eventcatcher.server.EventMetricsPublisher.publishToHMON(EventMetricsPublisher.java:173)
	at com.cloudera.cmf.eventcatcher.server.EventMetricsPublisher.run(EventMetricsPublisher.java:103)
	at com.cloudera.enterprise.PeriodicEnterpriseService$UnexceptionablePeriodicRunnable.run(PeriodicEnterpriseService.java:67)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.avro.AvroRemoteException: java.net.ConnectException: Connection refused
	at org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:88)
	... 6 more
Caused by: java.net.ConnectException: Connection refused
	at java.net.PlainSocketImpl.socketConnect(Native Method)

 

What are these SMON and HMON? Are these databases or service some names?

My db.properties has these databases: scm, amon, rman, nav (NO smon or hmon).

I an connect to all databases manually using psql.

 

"Is this a the right CM URL when inside your cluster?"

-yep, I can access the web manager.

 

The biggest issue is that error messages and exceptions are not helpful at all. They do not provide any debug information/traces, but merely say that an "error has occurred".

avatar
Explorer

Ok, so what I have done, I reinstalled Cloudera manager roles.

Now the monitor works, but gives health issues:

"WARNING  hostname ip-10-0-1-171.compute.internal differs from the canonical name localhost".

 

/etc/hosts contents:

127.0.0.1 localhost

10.0.1.171 ip-10-0-1-171.compute.internal

 

Is there a probem with my hosts file?

avatar
Explorer

Another problem: when trying to download a full log file via CLoudera Manager I get:

HTTP ERROR 502

Problem accessing /cmf/process/all/logs/download. Reason:

 

    Connection refused
Could not connect to host.

I thought to debug the jetty, but it overrides logging settings:

"-Dlog4j.configuration=file:/etc/cloudera-scm-server/log4j.properties -Dcmf.root.logger=INFO,LOGFILE"

and I don't know where the startup settings are kept.

avatar
HMON is the Host Monitor role in the Management service. SMON is the Service Monitor role.

That warning about the hostname seems to likely be causing these issues. You could try removing the 2nd line of your /etc/hosts file, but frankly I'm not an expert on how that file works.

avatar
Explorer

Finally, the issue has been solved!

 

Indeed, Darren, the problem was in the hostname resolution.

We looked at the database HOSTS table and instead of fqdn one entry was "localhost"; another node pointed to 127.0.0.1!

Changing the values solved the issue. Be aware that agent restart updates the values again, thus /etc/hosts must be correctly set.

 

The majority of service cofnigurations were corrupted, having "localhost" instead of a remote fqdn.

I don't know how this could have happened out of the blue.

 

Thanks, Darren 🙂

 

Case closed!,

Gin.

avatar
Master Collaborator

One follow up comment; 


The "host inspector" that is in the "hosts" section of the CM UI is a critical tool to be using to validate cluster configurations, it does a good job of identifying DNS/hostname/reverse lookup issues present on cluster nodes.

 

 

The "loopback" (127.0.0.1) entry in /etc/hosts should only contain "localhost" references and never the host name (or fqdn) value.

 

For reference's sake, this is discussed here:  

 

http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_ig_cm_requirements.h...