Support Questions

Find answers, ask questions, and share your expertise

After system reboot cloudera mangement service ( Event , Host , service monitor) fail to start

avatar
Rising Star

Hi

 

I have installed CDH 5.3.3 successfully on ubuntu 14.04  . but when i reboot  my ubuntu system , cloudera mangement service ( Event , Host , service monitor) fail to start.

 

show follwing errors:

 

Event Server :

 

12:37:40.957 PM WARN com.cloudera.cmf.event.publish.EventStorePublisherWithRetry
Failed to publish event: SimpleEvent{attributes={STACKTRACE=[java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850)
at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1091)
at com.cloudera.cmf.BasicScmProxy.authenticate(BasicScmProxy.java:188)
at com.cloudera.cmf.BasicScmProxy.authenticateAndFetchScmDescriptor(BasicScmProxy.java:301)
at com.cloudera.cmf.BasicScmProxy.getScmDescriptor(BasicScmProxy.java:346)
at com.cloudera.cmf.BasicScmProxy.getScmDescriptor(BasicScmProxy.java:326)
at com.cloudera.cmf.eventcatcher.server.EventCatcherService.main(EventCatcherService.java:100)
], EXCEPTION_TYPES=[java.net.ConnectException], ROLE=[mgmt-EVENTSERVER-e8c92ccecb4376455a55563353303d3f], SEVERITY=[IMPORTANT], SERVICE=[mgmt], HOST_IDS=[0001a5a1-846a-4022-b4c6-a204abd12813], LOG_LEVEL=[WARN], ROLE_TYPE=[EVENTSERVER], CATEGORY=[LOG_MESSAGE], SERVICE_TYPE=[MGMT], HOSTS=[master.novalocal], EVENTCODE=[EV_LOG_EVENT]}, content=IOException while getting descriptor, timestamp=1429645060761}
12:37:42.831 PM WARN com.cloudera.cmf.eventcatcher.server.EventCatcherService
No descriptor fetched from http://master.novalocal:7180 on after 2 tries, sleeping...
12:37:44.833 PM WARN com.cloudera.cmf.eventcatcher.server.EventCatcherService
No descriptor fetched from http://master.novalocal:7180 on after 3 tries, sleeping...
12:37:46.837 PM WARN com.cloudera.cmf.eventcatcher.server.EventCatcherService
No descriptor fetched from http://master.novalocal:7180 on after 4 tries, sleeping...
12:37:48.838 PM WARN com.cloudera.cmf.eventcatcher.server.EventCatcherService
No descriptor fetched from http://master.novalocal:7180 on after 5 tries, sleeping...
12:37:50.838 PM ERROR com.cloudera.cmf.eventcatcher.server.EventCatcherService
Could not fetch descriptor after 5 tries, exiting.

 

 

Event Server :

 

12:37:39.548 PM WARN com.cloudera.cmf.event.publish.EventStorePublisherWithRetry
Failed to publish event: SimpleEvent{attributes={STACKTRACE=[java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850)
at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1091)
at com.cloudera.cmf.BasicScmProxy.authenticate(BasicScmProxy.java:188)
at com.cloudera.cmf.BasicScmProxy.authenticateAndFetchScmDescriptor(BasicScmProxy.java:301)
at com.cloudera.cmf.BasicScmProxy.getScmDescriptor(BasicScmProxy.java:346)
at com.cloudera.cmon.firehose.Main.main(Main.java:374)
], EXCEPTION_TYPES=[java.net.ConnectException], ROLE=[mgmt-SERVICEMONITOR-e8c92ccecb4376455a55563353303d3f], SEVERITY=[IMPORTANT], SERVICE=[mgmt], HOST_IDS=[0001a5a1-846a-4022-b4c6-a204abd12813], LOG_LEVEL=[WARN], ROLE_TYPE=[SERVICEMONITOR], CATEGORY=[LOG_MESSAGE], SERVICE_TYPE=[MGMT], HOSTS=[master.novalocal], EVENTCODE=[EV_LOG_EVENT]}, content=IOException while getting descriptor, timestamp=1429645059382}
12:37:41.451 PM WARN com.cloudera.cmon.firehose.Main
No descriptor fetched from http://master.novalocal:7180 on after 2 tries, sleeping...
12:37:43.452 PM WARN com.cloudera.cmon.firehose.Main
No descriptor fetched from http://master.novalocal:7180 on after 3 tries, sleeping...
12:37:45.454 PM WARN com.cloudera.cmon.firehose.Main
No descriptor fetched from http://master.novalocal:7180 on after 4 tries, sleeping...
12:37:47.456 PM WARN com.cloudera.cmon.firehose.Main
No descriptor fetched from http://master.novalocal:7180 on after 5 tries, sleeping...
12:37:49.456 PM ERROR com.cloudera.cmon.firehose.Main
Could not fetch descriptor after 5 tries, exiting.

 

 

 

 

Host Monitor :

 


12:37:40.676 PM WARN com.cloudera.cmf.event.publish.EventStorePublisherWithRetry
Failed to publish event: SimpleEvent{attributes={STACKTRACE=[java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850)
at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1091)
at com.cloudera.cmf.BasicScmProxy.authenticate(BasicScmProxy.java:188)
at com.cloudera.cmf.BasicScmProxy.authenticateAndFetchScmDescriptor(BasicScmProxy.java:301)
at com.cloudera.cmf.BasicScmProxy.getScmDescriptor(BasicScmProxy.java:346)
at com.cloudera.cmon.firehose.Main.main(Main.java:374)
], EXCEPTION_TYPES=[java.net.ConnectException], ROLE=[mgmt-HOSTMONITOR-e8c92ccecb4376455a55563353303d3f], SEVERITY=[IMPORTANT], SERVICE=[mgmt], HOST_IDS=[0001a5a1-846a-4022-b4c6-a204abd12813], LOG_LEVEL=[WARN], ROLE_TYPE=[HOSTMONITOR], CATEGORY=[LOG_MESSAGE], SERVICE_TYPE=[MGMT], HOSTS=[master.novalocal], EVENTCODE=[EV_LOG_EVENT]}, content=IOException while getting descriptor, timestamp=1429645060526}
12:37:42.599 PM WARN com.cloudera.cmon.firehose.Main
No descriptor fetched from http://master.novalocal:7180 on after 2 tries, sleeping...
12:37:44.601 PM WARN com.cloudera.cmon.firehose.Main
No descriptor fetched from http://master.novalocal:7180 on after 3 tries, sleeping...
12:37:46.602 PM WARN com.cloudera.cmon.firehose.Main
No descriptor fetched from http://master.novalocal:7180 on after 4 tries, sleeping...
12:37:48.603 PM WARN com.cloudera.cmon.firehose.Main
No descriptor fetched from http://master.novalocal:7180 on after 5 tries, sleeping...
12:37:50.604 PM ERROR com.cloudera.cmon.firehose.Main
Could not fetch descriptor after 5 tries, exiting.

 

 

 

Your help is much appreciated.

 

 

Regards

Prateek

 

 

8 REPLIES 8

avatar
New Contributor

Hi Pradeep,

 

Were you able to resolve this?

 

I'm facing the same problem.

 

Thanks,

shr1k

avatar
Master Guru

Hello,

 

If you are getting the same errors as indicated in the original post that each management service cannot fetch the descriptor, that indicates a problem for the management services contacting Cloudera Manager.  In order to know what hosts, services, and roles participate in the cluster (among other things), the management services must be able to retrieve the a descriptor for the cluster from CM upon startup.  If it cannot be retreived, then the management service will fail to start all the way.

 

The original post shows that the root cause of this is that no connection can be made.  If SSL is not involved, then typically there is a firewall or some other configuration issue that is preventing the management services from resolving/connecting.  If you have exactly the same stack traces, I would try shutting off any firewalls on the hosts where the management services run and where Cloudere Manager runs.  Try using telnet or ncat to ensure that you can make a connection from the management service host to the CM host.

 

If that doesn't help, you might post the exceptions you are seeing as there may be something different about the cause in your case.

 

-Ben

avatar
Explorer

Fresh simple default install of latest 5.7.x.p0.76. Cloudera Manager and HMS, HS2, HS, NM, SM and OS roles are on one node, data nodes and other roles are elsewhere.

Cluster shows good health, charts are updated.

Restart this node - results in no metrics and host monitor connection refused messages. Restarting Cloudera Management Service solves the problem.

 

Is there a way to be be able to restart the node without manually restarting Cloudera Management Service later?

Is there a race condition?

avatar
Contributor

Hi aroraprateek

 

I have the same issue and I have resolved.

 

avatar
Contributor

Hi,

 

I work with cdh5.4.7 and I have the same issue, and I resolved it.

 

When Cloudera Manager server is restarted after upgrade or maintance tasks this starts cloudera server and cloudera agent, but it does not start Cloudera Management Services (mgmt).

 

The reason is because cloudera-scm-server and cloudera-scm-agent is configured to start at the same time:

[ cloudera_server ]: grep chkconfig /etc/init.d/cloudera-scm-*
/etc/init.d/cloudera-scm-agent:# chkconfig: 2345 90 10
/etc/init.d/cloudera-scm-server:# chkconfig: 2345 90 10

 

Cloudera Agent start Cloudera Management Services (mgmt) and it needs to connect to Cloudera Server, Cloudera Server takes more time to start than Cloudera Agent. Cloudera Agent tries to start mgmt 5 times with only 2 seconds between every retry, finally mgmt cannot start (in mgmt role logs I can see "connection refused" errors):

2017-01-02 15:06:44,673 WARN com.cloudera.cmon.firehose.Main: No descriptor fetched from https://cloudera_server:7183 on after 1 tries, sleeping...
2017-01-02 15:06:44,798 WARN com.cloudera.cmf.event.publish.EventStorePublisherWithRetry: Failed to publish event: SimpleEvent{attributes={ROLE_TYPE=[SERVICEMONITOR], EXCEPTION_TYPES=[java.net.ConnectException], HOST_IDS=[..], STACKTRACE=[java.net.ConnectException: Connection refused
[..]
2017-01-02 15:06:46,708 WARN com.cloudera.cmon.firehose.Main: No descriptor fetched from https://cloudera_server:7183 on after 2 tries, sleeping...
[..]
2017-01-02 15:06:52,724 WARN com.cloudera.cmon.firehose.Main: No descriptor fetched from https://cloudera_server:7183 on after 5 tries, sleeping...


To temporally fix this issue I do that:

 

1. Change start order:
I changed to this (server 90 to 89):
[ cloudera_server ]: grep chkconfig /etc/init.d/cloudera-scm-*
/etc/init.d/cloudera-scm-agent:# chkconfig: 2345 90 10
/etc/init.d/cloudera-scm-server:# chkconfig: 2345 89 10

 

2. Add cloudera server check in agent init start script.
/etc/init.d/cloudera-scm-agent (green line):
---
[..]
start() {
[..]
+ for i in $(seq 1 30); do curl -k -s -I $(facter cdh_url | awk -F\/api '{print $1}') | grep -q '200 OK' &>/tmp/init_cloudera_agent.out && break; sleep 10; done
$CMF_SUDO_CMD /bin/bash -c "nohup $AGENT_SCRIPT $CMF_AGENT_ARGS" >> $AGENT_OUT 2>&1 </dev/null &
[..]
}
[..]
---


* cdh_url is a custom facter that retur https://cloudera_server:7183/api/v10

 

If I only change start order it doesn't works because when "/etc/init.d/cloudera-scm-server start" is executed it doesn't wait to be completely started, it returns OK immediately (but is starting yet in background). When I reboot this server it starts cloudera-scm-server and immediately starts cloudera-scm-agent, cloudera-scm-agent starts faster than cloudera-scm-server and mgmt cannot connect to cloudera server web, after 5 tries it still down and I need to start mgmt manually.

 

If I do this changes it works fine, but I think that I should not change this configurations…

 

Another valid solution is that cloudera-scm-server waits to be successful and completely started to return OK and start first server and then agent with mgmt services, but for the moment it works for me.

 


Marc.

avatar
Contributor

Hi,

I encountered the same issue, and resolved it.  

cdh5.9.1

Cloudera Management Service > Configration > Search "Descriptor"

Set "Descriptor Fetch Max Tries" to a larger value - 60 (default: 5)

I left "Descriptor Fetch Tries Interval" as default - 2 seconds.


result (HOSTMONITOR log) 


2017-05-23 12:09:56,804 WARN com.cloudera.cmon.firehose.Main: No descriptor fetched from http://cloudera-manager-server:7180 on after 27 tries, sleeping...
2017-05-23 12:09:58,805 WARN com.cloudera.cmon.firehose.Main: No descriptor fetched from http://cloudera-manager-server:7180 on after 28 tries, sleeping...
2017-05-23 12:10:00,806 WARN com.cloudera.cmon.firehose.Main: No descriptor fetched from http://cloudera-manager-server:7180 on after 29 tries, sleeping...
2017-05-23 12:10:04,029 INFO com.cloudera.cmf.BasicScmProxy: Using encrypted credentials for SCM
2017-05-23 12:10:04,182 INFO com.cloudera.cmf.BasicScmProxy: Authenticated to SCM.
2017-05-23 12:10:07,595 INFO com.cloudera.cmon.firehose.Main: SCM descriptor fragments fetched successfully
 
 
so do other mgmt services

 

Nob.

avatar
Explorer
Thanks a lot. I have encountered the same problem while upgrading from CDH5.10.0 to CDH5.11.0. Management services (including Navigator) where not able to start.
I have followed your instructions and after restart of cloudera agent, the mgmt services were able to start.

avatar
Explorer

Hi, paste your /etc/hosts contents here. and service --status-all from comamnd prompt?

BR

Sam