Created 04-24-2017 04:54 PM
I want to set up a HDP 2.6 cluster with Ambari 2.5 on AWS. I tried this before on RedHat 7.3, but found out that version is not in the support matrix. So I started over on Ubuntu 16.04, because it's in all the support matrices.
I have Ambari Server running on an edge node and I have six Ubuntu Server 16.04 nodes ready as masters and workers. I believe I've done every step in the documentation (https://docs.hortonworks.com/HDPDocuments/Ambari-2.5.0.3/bk_ambari-installation/content/ch_Getting_Ready.html).
The nodes are passwordless-ly accessible as root. One thing where I might have gone a bit off the beaten track: I installed Ambari Server at a different account then root: ambari.
I've used the Cluster Install Wizard to create my cluster. I've chosen version 2.6.0.3, tried both public and local repositories, and entered the list of nodes on separated lines. As SSH Private Key I've used the id_rsa file from the edge node in /root/.ssh. Hope that's the correct one.
When I start the install it says it's installing, which takes forever and never returns. The nodes never get the ambari-agent.
So let's check the logs what's actually happening. In /var/log/ambari-server/ambari-server.log it shows these messages:
24 Apr 2017 06:54:08,179 INFO [pool-19-thread-1] BSHostStatusCollector:55 - Request directory /var/run/ambari-server/bootstrap/1 24 Apr 2017 06:54:08,179 INFO [pool-19-thread-1] BSHostStatusCollector:62 - HostList for polling on [ip-172-16-100-169.eu-west-1.compute.internal] 24 Apr 2017 06:54:18,179 INFO [pool-19-thread-1] BSHostStatusCollector:55 - Request directory /var/run/ambari-server/bootstrap/1 24 Apr 2017 06:54:18,180 INFO [pool-19-thread-1] BSHostStatusCollector:62 - HostList for polling on [ip-172-16-100-169.eu-west-1.compute.internal]
(This is the log where I try to add just one node to the cluster. With all six I get just about the same, but with more hosts mentioned.)
In /var/log/auth.log:
Apr 24 06:53:31 ip-172-16-100-12 sshd[2351]: Accepted publickey for ubuntu from 83.128.84.60 port 52221 ssh2: RSA<not sure it's safe to show that online> Apr 24 06:53:31 ip-172-16-100-12 sshd[2351]: pam_unix(sshd:session): session opened for user ubuntu by (uid=0)Apr 24 06:53:31 ip-172-16-100-12 systemd-logind[1170]: New session 3 of user ubuntu. Apr 24 06:53:37 ip-172-16-100-12 sudo: ubuntu : TTY=pts/1 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/bin/su -Apr 24 06:53:37 ip-172-16-100-12 sudo: pam_unix(sudo:session): session opened for user root by ubuntu(uid=0) Apr 24 06:53:37 ip-172-16-100-12 su[2398]: Successful su for root by rootApr 24 06:53:37 ip-172-16-100-12 su[2398]: + /dev/pts/1 root:root Apr 24 06:53:37 ip-172-16-100-12 su[2398]: pam_unix(su:session): session opened for user root by ubuntu(uid=0) Apr 24 06:53:37 ip-172-16-100-12 su[2398]: pam_systemd(su:session): Cannot create session: Already running in a session
Not sure what to make of the last error.
At the node I want to make my first master node, in /var/log/auth.log:
Apr 24 06:53:45 ip-172-16-100-169 sshd[1398]: Accepted publickey for root from 172.16.100.12 port 40572 ssh2: RSA SHA256:<not sure it's safe to show that online> Apr 24 06:53:45 ip-172-16-100-169 sshd[1398]: pam_unix(sshd:session): session opened for user root by (uid=0)Apr 24 06:53:45 ip-172-16-100-169 systemd-logind[1166]: New session 1 of user root. Apr 24 06:53:45 ip-172-16-100-169 systemd: pam_unix(systemd-user:session): session opened for user root by (uid=0)
So, looks like the connection to the proposed master node is (mostly) succesful?
Now I already read other discussions from people with the same type of errors and I've tried about every advise I've seen given.
Adjusting the fully qualified domain name was one. So let's check that:
root@ip-172-16-100-169:/var/log# hostname ip-172-16-100-169.eu-west-1.compute.internal root@ip-172-16-100-169:/var/log# hostname -f ip-172-16-100-169.eu-west-1.compute.internal root@ip-172-16-100-169:/var/log# cat /etc/hostname ip-172-16-100-169.eu-west-1.compute.internal
How about NTP?
root@ip-172-16-100-169:/var/log# sudo ntpq -p remote refid st t when poll reach delay offset jitter============================================================================== 0.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.000 1.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.000 2.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.000 3.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.000 ntp.ubuntu.com .POOL. 16 p - 64 0 0.000 0.000 0.000
Firewalls? SELinux?
root@ip-172-16-100-169:/var/log# sudo ufw status Status: inactive root@ip-172-16-100-169:/var/log# apparmor_status apparmor module is loaded. 0 profiles are loaded. 0 profiles are in enforce mode. 0 profiles are in complain mode. 0 processes have profiles defined. 0 processes are in enforce mode. 0 processes are in complain mode. 0 processes are unconfined but have a profile defined.
How about required software? Well, I won't go through the whole list, but checked it and doublechecked it.
Any ideas are very welcome. I would really like to see this cluster run.
Created 04-28-2017 09:42 AM
I've progressed a little further. I've overlooked the advise to create the /var/run/ambari-server/bootstrap and /var/run/ambari-server/stack-recommendations directories on the Ambari server node:
cd /var/run/ambari-server/ mkdir bootstrap mkdir stack-recommendations
For good measure I also created the /var/run/ambari-server/bootstrap/1 directory.
cd /var/run/ambari-server/bootstrap/ mkdir 1
Now the cluster creation tells it failed. In the log:
Command start time 2017-04-28 09:07:03 chmod: cannot access '/var/lib/ambari-agent/data': No such file or directory
So I created /var/lib/ambari-agent/data on all the nodes. Now the cluster installation gets further up to "Running create-python-wrap script". Then there's this error:
Cannot detect python for ambari to use. Please manually set link to point to correct python binary
When I do "which python" it doesn't find Python. I thought Ubuntu Server 16.04 came with Python installed. Guess I was wrong. To be continued...
Created 04-28-2017 10:37 AM
Okay, installed Python on all nodes. Time for the next error 🙂 . And the next error is...
========================== Update apt cache of repository... ========================== Command start time 2017-04-28 10:23:02 Reading package lists... 0% Reading package lists... 0% Reading package lists... 16% Reading package lists... Done E: Could not get lock /var/lib/apt/lists/lock - open (11: Resource temporarily unavailable) E: Unable to lock directory /var/lib/apt/lists/
I'll have to check that out.
Created 04-28-2017 01:40 PM
Lock problem solved by running:
sudo rm /var/lib/apt/lists/* -vf sudo apt-get update
And now the installation runs for quite some time until... it fails without error at this step:
Update apt cache of repository
In the /var/run/ambari-server/bootstrap/11/bootstrap.err I find this error:
INFO:root:BootStrapping hosts ['ip-172-16-100-169.eu-west-1.compute.internal', 'ip-172-16-100-164.eu-west-1.compute.internal', 'ip-172-16-100-26.eu-west-1.compute.internal', 'ip-172-16-100-139.eu-west-1.compute.internal', 'ip-172-16-100-128.eu-west-1.compute.internal', 'ip-172-16-100-116.eu-west-1.compute.internal'] using /usr/lib/python2.6/site-packages/ambari_server cluster primary OS: ubuntu16 with user 'root'with ssh Port '22' sshKey File /var/run/ambari-server/bootstrap/11/sshKey password File null using tmp dir /var/run/ambari-server/bootstrap/11 ambari: ip-172-16-100-12.eu-west-1.compute.internal; server_port: 8080; ambari version: 2.5.0.3; user_run_as: root INFO:root:Executing parallel bootstrap
However I find it hard to believe that connection/key related act out after so much work done on the other nodes.
Created 05-15-2017 07:56 AM
I have discussed these issues with a colleague and he told me the master and worker nodes must be able to connect to the Internet to do updates. For example you must be able to ping hortonworks.com. At my cluster the master/worker nodes could not.
On advise by my colleague I've installed a NAT gateway in the public subnet, connected to the private subnet. I still couldn't ping to the outside world. Turned out my route tables weren't setup correctly. After this my master/worker nodes could ping to hortonworks.com.
The first installation ended in errors, because my bootstrap directory was gone and later I also needed to create the stack-recommendations directory. Keep these commands ready when installing HDP 2.6 (on Ubuntu anyway):
mkdir /var/run/ambari-server/bootstrap chmod 777 /var/run/ambari-server/bootstrap mkdir /var/run/ambari-server/stack-recommendations chmod 777 /var/run/ambari-server/stack-recommendations
After all this is done, you finally can get your master/worker nodes registered.
But you're not done yet, because next comes the install of the HDP software. But that's for another time.