Support Questions

Find answers, ask questions, and share your expertise

Ambari 2.5 cluster install hangs on BSHostStatusCollector:55 and BSHostStatusCollector:62 on Ubuntu Server 16.04

Contributor

I want to set up a HDP 2.6 cluster with Ambari 2.5 on AWS. I tried this before on RedHat 7.3, but found out that version is not in the support matrix. So I started over on Ubuntu 16.04, because it's in all the support matrices.

I have Ambari Server running on an edge node and I have six Ubuntu Server 16.04 nodes ready as masters and workers. I believe I've done every step in the documentation (https://docs.hortonworks.com/HDPDocuments/Ambari-2.5.0.3/bk_ambari-installation/content/ch_Getting_Ready.html).

The nodes are passwordless-ly accessible as root. One thing where I might have gone a bit off the beaten track: I installed Ambari Server at a different account then root: ambari.

I've used the Cluster Install Wizard to create my cluster. I've chosen version 2.6.0.3, tried both public and local repositories, and entered the list of nodes on separated lines. As SSH Private Key I've used the id_rsa file from the edge node in /root/.ssh. Hope that's the correct one.

When I start the install it says it's installing, which takes forever and never returns. The nodes never get the ambari-agent.

So let's check the logs what's actually happening. In /var/log/ambari-server/ambari-server.log it shows these messages:

24 Apr 2017 06:54:08,179  INFO [pool-19-thread-1] BSHostStatusCollector:55 - Request directory /var/run/ambari-server/bootstrap/1
24 Apr 2017 06:54:08,179  INFO [pool-19-thread-1] BSHostStatusCollector:62 - HostList for polling on [ip-172-16-100-169.eu-west-1.compute.internal]
24 Apr 2017 06:54:18,179  INFO [pool-19-thread-1] BSHostStatusCollector:55 - Request directory /var/run/ambari-server/bootstrap/1
24 Apr 2017 06:54:18,180  INFO [pool-19-thread-1] BSHostStatusCollector:62 - HostList for polling on [ip-172-16-100-169.eu-west-1.compute.internal]

(This is the log where I try to add just one node to the cluster. With all six I get just about the same, but with more hosts mentioned.)

In /var/log/auth.log:

Apr 24 06:53:31 ip-172-16-100-12 sshd[2351]: Accepted publickey for ubuntu from 83.128.84.60 port 52221 ssh2: RSA<not sure it's safe to show that online>
Apr 24 06:53:31 ip-172-16-100-12 sshd[2351]: pam_unix(sshd:session): session opened for user ubuntu by (uid=0)Apr 24 06:53:31 ip-172-16-100-12 systemd-logind[1170]: New session 3 of user ubuntu.
Apr 24 06:53:37 ip-172-16-100-12 sudo:  ubuntu : TTY=pts/1 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/bin/su -Apr 24 06:53:37 ip-172-16-100-12 sudo: pam_unix(sudo:session): session opened for user root by ubuntu(uid=0)
Apr 24 06:53:37 ip-172-16-100-12 su[2398]: Successful su for root by rootApr 24 06:53:37 ip-172-16-100-12 su[2398]: + /dev/pts/1 root:root
Apr 24 06:53:37 ip-172-16-100-12 su[2398]: pam_unix(su:session): session opened for user root by ubuntu(uid=0)
Apr 24 06:53:37 ip-172-16-100-12 su[2398]: pam_systemd(su:session): Cannot create session: Already running in a session

Not sure what to make of the last error.

At the node I want to make my first master node, in /var/log/auth.log:

Apr 24 06:53:45 ip-172-16-100-169 sshd[1398]: Accepted publickey for root from 172.16.100.12 port 40572 ssh2: RSA SHA256:<not sure it's safe to show that online>
Apr 24 06:53:45 ip-172-16-100-169 sshd[1398]: pam_unix(sshd:session): session opened for user root by (uid=0)Apr 24 06:53:45 ip-172-16-100-169 systemd-logind[1166]: New session 1 of user root.
Apr 24 06:53:45 ip-172-16-100-169 systemd: pam_unix(systemd-user:session): session opened for user root by (uid=0)

So, looks like the connection to the proposed master node is (mostly) succesful?

Now I already read other discussions from people with the same type of errors and I've tried about every advise I've seen given.

Adjusting the fully qualified domain name was one. So let's check that:

root@ip-172-16-100-169:/var/log# hostname
ip-172-16-100-169.eu-west-1.compute.internal
root@ip-172-16-100-169:/var/log# hostname -f
ip-172-16-100-169.eu-west-1.compute.internal
root@ip-172-16-100-169:/var/log# cat /etc/hostname
ip-172-16-100-169.eu-west-1.compute.internal

How about NTP?

root@ip-172-16-100-169:/var/log# sudo ntpq -p  remote  refid  st t when poll reach  delay  offset  jitter============================================================================== 0.ubuntu.pool.n .POOL.  16 p  -  64  0  0.000  0.000  0.000 1.ubuntu.pool.n .POOL.  16 p  -  64  0  0.000  0.000  0.000 2.ubuntu.pool.n .POOL.  16 p  -  64  0  0.000  0.000  0.000 3.ubuntu.pool.n .POOL.  16 p  -  64  0  0.000  0.000  0.000 ntp.ubuntu.com  .POOL.  16 p  -  64  0  0.000  0.000  0.000

Firewalls? SELinux?

root@ip-172-16-100-169:/var/log# sudo ufw status
Status: inactive
root@ip-172-16-100-169:/var/log# apparmor_status
apparmor module is loaded.
0 profiles are loaded.
0 profiles are in enforce mode.
0 profiles are in complain mode.
0 processes have profiles defined.
0 processes are in enforce mode.
0 processes are in complain mode.
0 processes are unconfined but have a profile defined.

How about required software? Well, I won't go through the whole list, but checked it and doublechecked it.

Any ideas are very welcome. I would really like to see this cluster run.

4 REPLIES 4

Contributor

I've progressed a little further. I've overlooked the advise to create the /var/run/ambari-server/bootstrap and /var/run/ambari-server/stack-recommendations directories on the Ambari server node:

cd /var/run/ambari-server/

mkdir bootstrap

mkdir stack-recommendations

For good measure I also created the /var/run/ambari-server/bootstrap/1 directory.

cd /var/run/ambari-server/bootstrap/

mkdir 1

Now the cluster creation tells it failed. In the log:

Command start time 2017-04-28 09:07:03

chmod: cannot access '/var/lib/ambari-agent/data': No such file or directory

So I created /var/lib/ambari-agent/data on all the nodes. Now the cluster installation gets further up to "Running create-python-wrap script". Then there's this error:

Cannot detect python for ambari to use. Please manually set  link to point to correct python binary

When I do "which python" it doesn't find Python. I thought Ubuntu Server 16.04 came with Python installed. Guess I was wrong. To be continued...

Contributor

Okay, installed Python on all nodes. Time for the next error 🙂 . And the next error is...

==========================
Update apt cache of repository...
==========================

Command start time 2017-04-28 10:23:02

Reading package lists... 0%

Reading package lists... 0%

Reading package lists... 16%

Reading package lists... Done

E: Could not get lock /var/lib/apt/lists/lock - open (11: Resource temporarily unavailable)
E: Unable to lock directory /var/lib/apt/lists/

I'll have to check that out.

Contributor

Lock problem solved by running:

sudo rm /var/lib/apt/lists/* -vf 

sudo apt-get update

And now the installation runs for quite some time until... it fails without error at this step:

Update apt cache of repository

In the /var/run/ambari-server/bootstrap/11/bootstrap.err I find this error:

INFO:root:BootStrapping hosts ['ip-172-16-100-169.eu-west-1.compute.internal', 'ip-172-16-100-164.eu-west-1.compute.internal', 'ip-172-16-100-26.eu-west-1.compute.internal', 'ip-172-16-100-139.eu-west-1.compute.internal', 'ip-172-16-100-128.eu-west-1.compute.internal', 'ip-172-16-100-116.eu-west-1.compute.internal'] using /usr/lib/python2.6/site-packages/ambari_server cluster primary OS: ubuntu16 with user 'root'with ssh Port '22' sshKey File /var/run/ambari-server/bootstrap/11/sshKey password File null using tmp dir /var/run/ambari-server/bootstrap/11 ambari: ip-172-16-100-12.eu-west-1.compute.internal; server_port: 8080; ambari version: 2.5.0.3; user_run_as: root

INFO:root:Executing parallel bootstrap

However I find it hard to believe that connection/key related act out after so much work done on the other nodes.

Contributor

I have discussed these issues with a colleague and he told me the master and worker nodes must be able to connect to the Internet to do updates. For example you must be able to ping hortonworks.com. At my cluster the master/worker nodes could not.

On advise by my colleague I've installed a NAT gateway in the public subnet, connected to the private subnet. I still couldn't ping to the outside world. Turned out my route tables weren't setup correctly. After this my master/worker nodes could ping to hortonworks.com.

The first installation ended in errors, because my bootstrap directory was gone and later I also needed to create the stack-recommendations directory. Keep these commands ready when installing HDP 2.6 (on Ubuntu anyway):

mkdir /var/run/ambari-server/bootstrap
chmod 777 /var/run/ambari-server/bootstrap
mkdir /var/run/ambari-server/stack-recommendations
chmod 777 /var/run/ambari-server/stack-recommendations

After all this is done, you finally can get your master/worker nodes registered.

But you're not done yet, because next comes the install of the HDP software. But that's for another time.