Created on 10-05-2017 07:37 AM - edited 09-16-2022 05:21 AM
Yesterday I installed HDP 2.6.2 using the Ambari method, on a 3 node Ubuntu 16.10 x86_64 environment - or rather tried to. I'm using JDK 8. I installed all components, pretty much with defaults.
All the required ports should be open between nodes (I'm using cloudfoundry)
The install failed towards the latter part when starting some services. Checking ambari things seem deployed, though I see that
* zookeeper is starting ok
* Ambari infra is failing to start,
java.util.concurrent.TimeoutException: Could not connect to ZooKeeper hdp262-1.novalocal:2181,hdp262-2.novalocal:2181,hdp262-3.novalocal:2181 within 15000 ms java.util.concurrent.TimeoutException: Could not connect to ZooKeeper hdp262-1.novalocal:2181,hdp262-2.novalocal:2181,hdp262-3.novalocal:2181 within 15000 ms Return code: 1. Sleeping for 5 sec(s) 2017-10-05 07:27:28,881 - Execute['ambari-sudo.sh JAVA_HOME=/usr/jdk64/jdk1.8.0_112 /usr/lib/ambari-infra-solr-client/solrCloudCli.sh --zookeeper-connect-string hdp262-1.novalocal:2181,hdp262-2.novalocal:2181,hdp262-3.novalocal:2181 --znode /infra-solr --create-znode --retry 30 --interval 5'] {}
I AM able to connect via TCP between those ports on the nodes - for example
cloudusr@hdp262-1:/var/log/ambari-agent$ telnet hdp262-1.novalocal 2181
Trying 127.0.1.1...
Connected to hdp262-1.novalocal.
Escape character is '^]'.
^]quit
telnet> quit
Connection closed.
Meanwhile in the full ambari log I see:
Waiting for client to connect to ZooKeeper
Opening socket connection to server hdp262-1.novalocal/9.20.65.115:2181. Will not attempt to authenticate using SASL (unknown error)
Socket connection established to hdp262-1.novalocal/9.20.65.115:2181, initiating session
Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
I'm not familar enough with the stack/hdp/ambari/zookeeper... any tips?
Created 10-05-2017 08:25 AM
I suspect that the problem is here:
cloudusr@hdp262-1:/var/log/ambari-agent$ telnet hdp262-1.novalocal 2181 Trying 127.0.1.1... Connected to hdp262-1.novalocal. Escape character is '^]'. ^]quit
.
When you are doing telnet then the address "hdp262-1.novalocal" is being translated to "127.0.0.1" So are you sure that the host and IP Address mapping is fine at your end?
.
Please check the following outputs:
# cat /etc/hosts # hostname -f # python -c 'import socket;print socket.getfqdn()'
.
To verify if the zookeeper is listening to 127.0.0.1 address or bound to all addresses? Please run the following command from the host where the zookeeper is running.
# netstat -tnlpa | grep 2181
If you are getting the correct IP address mapping for the Hostname "hdp262-1.novalocal"
Created 10-05-2017 08:31 AM
🙂 In this particular case, hdp262-1.novalocal is indeed the system where the script is running, so it happens to be localhost too. I see the same errors when it tries to get to the other systems
My /etc/hosts contains:
127.0.1.1 hdp262-1.novalocal hdp262-1
127.0.0.1 localhost
9.20.65.115 hdp262-1.novalocal hdp262-1
9.20.65.135 hdp262-2.novalocal hdp262-2
9.20.65.175 hdp262-3.novalocal hdp262-3
and this is common through each machine (well the last 3 lines). This is a developer cf environment and those hostnames don't resolve over DNS - not ideal, but I figured this was a quick workaround (I'd need to work a little more to set something up for a local domain)
cloudusr@hdp262-1:~$ hostname -f
hdp262-1.novalocal
cloudusr@hdp262-1:~$ python -c 'import socket;print socket.getfqdn()'
hdp262-1.novalocal
Created 10-05-2017 08:35 AM
Can you remove or comment the first entry of your /etc/hosts to look like below
# 127.0.1.1 hdp262-1.novalocal hdp262-1 127.0.0.1 localhost 9.20.65.115 hdp262-1.novalocal hdp262-1 9.20.65.135 hdp262-2.novalocal hdp262-2 9.20.65.175 hdp262-3.novalocal hdp262-3
Then retry it should be okay
Created 10-05-2017 08:36 AM
cloudusr@hdp262-1:~$ netstat -tnlpa | grep 2181
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp6 0 0 :::2181 :::* LISTEN -
tcp6 0 0 127.0.1.1:2181 127.0.0.1:53344 TIME_WAIT
Created 10-05-2017 08:37 AM
We should not edit the first two lines of the "/etc/hosts" file. So it should ideall look like following:
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 9.20.65.115 hdp262-1.novalocal hdp262-1 9.20.65.135 hdp262-2.novalocal hdp262-2 9.20.65.175 hdp262-3.novalocal hdp262-3
.
Please see the Note in the following Doc: https://docs.hortonworks.com/HDPDocuments/Ambari-2.5.2.0/bk_ambari-installation/content/edit_the_hos...
Which says:
Do not remove the following two lines from your hosts file. Removing or editing the following lines may cause various programs that require network functionality to fail. 127.0.0.1 localhost.localdomain localhost ::1 localhost6.localdomain6 localhost6
.
Created 10-09-2017 04:09 PM
Created 10-05-2017 08:38 AM
Trying the /etc/hosts change...
Created 10-09-2017 04:15 PM
Sadly not, so I need to dig deeper, but ran out of time before having to prioritize something else for a few days. I'll return to the cluster later this week. In the interim I'm trying the 2.6.1 docker image. thanks for the tips. I'll update when I can clarify more (or retry the install from a clean base, with the appropriate hosts entries in place/replaced by DNS)
Created 11-06-2017 04:07 PM
Can you telnet to the other zookeeper nodes on 2181. Looks like you tested the telnet on the localhost node, but you need to check it for other hosts. I had faced exactly same error message and i noticed that i was missing an entry in the hosts file to one of the zookeeper nodes.
Created 08-02-2018 05:03 PM
Hi @Nigel Jones, have you found a solution to this problem yet? I am experiencing the exact same problems starting up HDP 3.0.0