I have an issue with all of our services, we did a firmware upgrade for our machine and now all of the services say:
"No host heartbeat; CDH versions cannot be verified."
and I cannot restart the services through the manager they will either Time-out or mention cannot communicate with the name node:
"Command aborted because of exception: Command timed-out after 150 seconds"
here is the log for our cluster, just a heads up the cluster is one machine and the data nodes are vm's.
The log file is too big to post please observe here:
My first steps for troubleshooting something like this would be to test network connectivity.
was there a reboot associated with the firmware update? Maybe some firewall rules reverted back post reboot?
Sorry, so my steps included running ifconfig on each node and testing to see if they
Then I checked the hosts file to ensure each node was resolving to the FQDN and ip address
regarding the firewalls
What recommendations would you take after these steps?
I think the only other thing i would test is to use something that can connect over TCP to verify none of the ports are being blocked by something other than IPTables (though i'm not sure what that would be)
in rhel for example:
nc -z <host> <port>
ping uses ICMP which is controlled in different ways from TCP.
If it's not network, looks like you are using cloudera manager? if so I'd defer to any cloudera employee as they have access to the code to better determine what could cause your issues.
If network connectivity is working
- please look up the Hosts tab in Cloudera Manager. Do you see all your slave nodes listed correctly?
- Are you using TLS between the nodes and CM?
- What do the agent logs under /var/log/cloudera-scm-agent/ say?
- Check /etc/cloudera-scm-agent/config.ini for the CM host they're meant to be heartbeat to.
Since you are using cloudera manager we should abandon this thread and continue in the other one