Since the connection refused error stopped me everytime, I decided to install ambari.repo and yum install ambari-agent on each node in cluster. so install cluster, 3 hosts failed at the first step--confirm hosts. only ambari-server host passed it.
the error was, failed to reconnect to local server as above. so seemed ip on the node have changed. I may have to use the VCP to have static IP, right? if yes, the limit for VCP is 5, I have over 10 nodes here.
Also, the host check has no warnings...
The key ports for the Ambari-Agent and Server communication are the following:
Ambari Server Port 8440 => Handshake Port for Ambari Agents to Ambari Server Ambari Server Port 8441 => Registration and Heartbeat Port for Ambari Agents to Ambari Server
The ambari server hostname can be found in the "/etc/ambari-agent/conf/ambari-agent.ini"
So for the ambari server and agent communication , it is must that the agents are able to communicate with the FQDN (hostname -f) mentioned in the ambari-agent.ini file and the mentioned ports..
So you will first need to ensure that agent machines are able to communicate with the ambari server on the mentioned hostname & ports.
VCP limitation is beyond ambari's capabilities to control. It's more of infrastructure issue.
Thank you so much Jay. this is the file I modified by replace localhost to $hostname -f on all nodes here.
Also, I started agent on all nodes, no problem. so how do I know if agent can communicate with master?
Agent communicates with the Master (AmbariServer) on those ports using https protocol so you can use any of the following approach to verify the connectivity, By running the following commands from the Agent machine to see if agent is able to connect to ambari
# telnet $AMBARI_HOSTNAME 8440 # telnet $AMBARI_HOSTNAME 8441 # openssl s_client -connect $AMBARI_HOSTNAME:8440 # openssl s_client -connect $AMBARI_HOSTNAME:8441
#hostname -f ip-172-31-13-143.us-west-2.compute.internal
#telnet ip-172-31-13-143.us-west-2.compute.internal8440 telnet: ip-172-31-13-143.us-west-2.compute.internal8440: Name or service not known ip-172-31-13-143.us-west-2.compute.internal8440: Unknown host
# telnet ip-172-31-13-143.us-west-2.compute.internal8441 telnet: ip-172-31-13-143.us-west-2.compute.internal8441: Name or service not known
both of #openssl s_client -connect ip-172-31-13-143.us-west-2.compute.internal:8440 and 8441 have no output.