from the agent log on one of the workers I see that - ERROR 2017-11-23 16:11:07,601 Controller.py:456 - Unable to reconnect to https://master02:8441/agent/v1/heartbeat/worker06.sys674.com (attempts=5, details=Request to https://master02:8441/agent/v1/heartbeat/worker06.sys674.com failed due to [Errno 111] Connection refused)
Ambari Server fetches some informations from the NameNode. Like the DataNode status. So as we see that the NameNode is saying only 3 DataNodes are Live means other 2 DataNodes are not able to communicate properly with the NameNode. Even though those DataNode might be running (means even if the DataNodes will be running and might be having a valid PID file) they are not communicating fine with the NameNode and hence ambari is just showing the information which it is getting from NameNode.
So at this point we can say that there is no issue from ambari side and it is showing the info about the Live DataNodes, what it is getting form the NameNode.
So in order to investigate why those DataNodes are not communicating fine with nameNode (Why Name Node is not showing all 5 nodes as Live) we will have to look at the NameNode log as well as the DataNode logs of the problematic DataNodes.
Regarding the Agent communication with Ambari Server:
Unable to reconnect to https://master02:8441/agent/v1/heartbeat/worker06.sys674.com
Please check if those hosts are resolving the Ambari Server hostname & IP Address properly? Pleas check the "/etc/hosts" file entry of those hosts to verify if the ambari host is resolving fine.
Also please check if there is any port blockage OR Firewall issue in communicating to ambari server port 8441 frm those hosts?
# cat /etc/hosts # nc -v master02 8441 (OR) # telnet master02 8441
Please confirm that the "master02" is actually your Ambari Server host? If not then please check the "/etc/ambari-agent/conf/ambari-agent.ini" file to verify if the Ambari Hostname is correctly mentioned there?
We see two issues here:
Issue-1). DataNode Live status issue. Which is fro HDFS side because the NameNode is shoiwng only 3 Live Nodes instead of 5 out of 5.
>>> So in order to investigate that issue we will need the following:
a. NameNode logs (complete log)
b. DataNode logs from the problematic hosts (complete log)
Issue-2). We see that agent is showing the Connection refused for https://master02:8441, Which can be related to OpenSSL / Python issue as well because ambari agent communicates to ambari server using HTTPS 8441 & 8440 ports using Python & openssl libraries.
>>> So we will need to see the ambari-server.log as well as the complete ambari-agent.log to get more details about this issue. Also need to check if the OpenSSL/Python version and OS versions are same on all hosts.
I check the logs on ambari-server , there are a lot of details but I see that --> Unable to propagate version for ServiceHostComponent on component: SPARK2_CLIENT, host: worker06.sys674.com. Error:
this version is the same on the ambari-server and the workers machines