Created 08-12-2014 09:38 AM
I have an issue with all of our services, we did a firmware upgrade for our machine and now all of the services say:
"No host heartbeat; CDH versions cannot be verified."
and I cannot restart the services through the manager they will either Time-out or mention cannot communicate with the name node:
"Command aborted because of exception: Command timed-out after 150 seconds"
here is the log for our cluster, just a heads up the cluster is one machine and the data nodes are vm's.
The log file is too big to post please observe here:
http://pastebin.com/CznqykHF
Created 08-12-2014 09:41 AM
Created 08-12-2014 09:47 AM
Hi Charles, thanks for the cloudera-scm-server.log, it's most helpful. It reads like the other side of the heartbeat communication - the service cloudera-scm-agent - may not yet be running or failed to start for some reason on 'usorla7hp1106x'. Could you pastebin /var/log/cloudera-scm-agent/cloudera-scm-agent.log and /var/log/cloudera-scm-agent/supervisord.log? Redact any IP's, hostnames or ID's you may feel necessary, and the last 500-1000 lines may be sufficient from either of them.
Also, what's the output of
# service cloudera-scm-agent status
# ps -ef | grep supervisord
Thanks
--
Created on 08-12-2014 09:57 AM - edited 08-12-2014 10:00 AM
Thanks for the quick reply
(/var/log/hadoop-yarn)-1179> service cloudera-scm-agent status
cloudera-scm-agent (pid 59385) is running...
g/hadoop-yarn)-1180> ps -ef | grep supervisord
root 59419 1 0 12:00 ? 00:00:00 /usr/lib64/cmf/agent/src/cmf/../../build/env/bin/python /usr/lib64/cmf/agent/src/cmf/../../build/env/bin/supervisord
root 65096 37187 0 12:48 pts/3 00:00:00 grep supervisord
Names have been replaced by CORRECT-FQDN-HERE.net
Here is the log file you requested for cloudera-scm-agent.log
Here is the log file for cloudera supervisord.log
Created 08-12-2014 10:06 AM
Great.
This appears to be the crux:
12/Aug/2014 12:52:03 +0000 59385 MainThread agent ERROR Heartbeating to CORRECT-FQDN-HERE.net:7182 failed.
Something is preventing the agent from properly heartbeating to port 7182 on the node where cloudera-scm-server runs. You said this is a single-node cluster though, right? as in this cloudera-scm-agent is running locally along with cloudera-scm-server on CORRECT-FQDN-HERE.net?
You've already spoken of iptables, selinux - anything that may have changed with the reboot like the FQDN? Have you made any changes to /etc/cloudera-scm-agent/config.ini before/during/after this reboot?
Created on 08-12-2014 10:15 AM - edited 08-12-2014 10:20 AM
This is a single machine; however, I am running three vm's (KVM) that act as datanodes and the "name node" is just sitting on the main machine. Question: am I suppose to run the agent seperatly from the cloudera-scm-server?
I have recieved little information from my department regarding the firmware upgrade (Let alone a notice lol!)
hostname, /etc/host, ifconfig and host -v -t A hostname
all match up, so nothing has changed...
That file you mentioned still looks the same from when I left it
I basically went home on a friday with everything operational and come back on a monday to see the manager FUBAR'd is there anything I can do to manually start these services perhaps the cloudera manager is not communicating correctly to these services?
EDIT: also is there something that is the agent needs to run successfully? maybe another service running?
Created 08-17-2014 03:50 AM
Can you double check the following please:
- /etc/cloudera-scm-agent/config.ini should have the hostname or IP address of the machine where Cloudera Manager runs
- if you see a host name above, ensure you can resolve it correctly from the slaves
# ping CORRECT-FQDN-HERE.net
# telnet CORRECT-FQDN-HERE.net 7182
The "host" command won't consult /etc/hosts, so need to use ping or somthing simple that just calls gethostbyname()