Support Questions

Find answers, ask questions, and share your expertise
Announcements
Welcome to the upgraded Community! Read this blog to see What’s New!

Ambari agent registration of HDF cluster fails inspite of exitcode 0. Setup of RHEL on MS Azure.

avatar
Explorer

==========================

Creating target directory...
==========================

Command start time 2018-05-16 06:08:52
chmod: cannot access ‘/var/lib/ambari-agent/data’: No such file or directory

Warning: Permanently added 'mtvm6.eastus.cloudapp.azure.com,40.117.251.23' (ECDSA) to the list of known hosts.
Connection to mtvm6.eastus.cloudapp.azure.com closed.
SSH command execution finished
host=mtvm6.eastus.cloudapp.azure.com, exitcode=0
Command end time 2018-05-16 06:08:52

==========================
Copying ambari sudo script...
==========================

Command start time 2018-05-16 06:08:52

scp /var/lib/ambari-server/ambari-sudo.sh
host=mtvm6.eastus.cloudapp.azure.com, exitcode=0
Command end time 2018-05-16 06:08:53

==========================
Copying common functions script...
==========================

Command start time 2018-05-16 06:08:53

scp /usr/lib/python2.6/site-packages/ambari_commons
host=mtvm6.eastus.cloudapp.azure.com, exitcode=0
Command end time 2018-05-16 06:08:53

==========================
Copying create-python-wrap script...
==========================

Command start time 2018-05-16 06:08:53

scp /var/lib/ambari-server/create-python-wrap.sh
host=mtvm6.eastus.cloudapp.azure.com, exitcode=0
Command end time 2018-05-16 06:08:54

==========================
Copying OS type check script...
==========================

Command start time 2018-05-16 06:08:54

scp /usr/lib/python2.6/site-packages/ambari_server/os_check_type.py
host=mtvm6.eastus.cloudapp.azure.com, exitcode=0
Command end time 2018-05-16 06:08:54

==========================
Running create-python-wrap script...
==========================

Command start time 2018-05-16 06:08:54

Connection to mtvm6.eastus.cloudapp.azure.com closed.
SSH command execution finished
host=mtvm6.eastus.cloudapp.azure.com, exitcode=0
Command end time 2018-05-16 06:08:55

==========================
Running OS type check...
==========================

Command start time 2018-05-16 06:08:55
Cluster primary/cluster OS family is redhat7 and local/current OS family is redhat7

Connection to mtvm6.eastus.cloudapp.azure.com closed.
SSH command execution finished
host=mtvm6.eastus.cloudapp.azure.com, exitcode=0
Command end time 2018-05-16 06:08:55

==========================
Checking 'sudo' package on remote host...
==========================

Command start time 2018-05-16 06:08:55
sudo-1.8.19p2-11.el7_4.x86_64

Connection to mtvm6.eastus.cloudapp.azure.com closed.
SSH command execution finished
host=mtvm6.eastus.cloudapp.azure.com, exitcode=0
Command end time 2018-05-16 06:08:56

==========================
Copying repo file to 'tmp' folder...
==========================

Command start time 2018-05-16 06:08:56

scp /etc/yum.repos.d/ambari.repo
host=mtvm6.eastus.cloudapp.azure.com, exitcode=0
Command end time 2018-05-16 06:08:57

==========================
Moving file to repo dir...
==========================

Command start time 2018-05-16 06:08:57

Connection to mtvm6.eastus.cloudapp.azure.com closed.
SSH command execution finished
host=mtvm6.eastus.cloudapp.azure.com, exitcode=0
Command end time 2018-05-16 06:08:57

==========================
Changing permissions for ambari.repo...
==========================

Command start time 2018-05-16 06:08:57

Connection to mtvm6.eastus.cloudapp.azure.com closed.
SSH command execution finished
host=mtvm6.eastus.cloudapp.azure.com, exitcode=0
Command end time 2018-05-16 06:08:57

==========================
Copying setup script file...
==========================

Command start time 2018-05-16 06:08:57

scp /usr/lib/python2.6/site-packages/ambari_server/setupAgent.py
host=mtvm6.eastus.cloudapp.azure.com, exitcode=0
Command end time 2018-05-16 06:08:58

==========================
Running setup agent script...
==========================

Command start time 2018-05-16 06:08:58
("INFO 2018-05-16 06:09:18,024 main.py:145 - loglevel=logging.INFO
INFO 2018-05-16 06:09:18,024 main.py:145 - loglevel=logging.INFO
INFO 2018-05-16 06:09:18,024 main.py:145 - loglevel=logging.INFO
INFO 2018-05-16 06:09:18,025 DataCleaner.py:39 - Data cleanup thread started
INFO 2018-05-16 06:09:18,027 DataCleaner.py:120 - Data cleanup started
INFO 2018-05-16 06:09:18,027 DataCleaner.py:122 - Data cleanup finished
INFO 2018-05-16 06:09:18,028 hostname.py:67 - agent:hostname_script configuration not defined thus read hostname 'mtvm6.eastus.cloudapp.azure.com' using socket.getfqdn().
INFO 2018-05-16 06:09:18,035 PingPortListener.py:50 - Ping port listener started on port: 8670
INFO 2018-05-16 06:09:18,038 main.py:437 - Connecting to Ambari server at https://myhdf.eastus.cloudapp.azure.com:8440 (104.211.60.99)
INFO 2018-05-16 06:09:18,038 NetUtil.py:70 - Connecting to https://myhdf.eastus.cloudapp.azure.com:8440/ca
", None)
("INFO 2018-05-16 06:09:18,024 main.py:145 - loglevel=logging.INFO
INFO 2018-05-16 06:09:18,024 main.py:145 - loglevel=logging.INFO
INFO 2018-05-16 06:09:18,024 main.py:145 - loglevel=logging.INFO
INFO 2018-05-16 06:09:18,025 DataCleaner.py:39 - Data cleanup thread started
INFO 2018-05-16 06:09:18,027 DataCleaner.py:120 - Data cleanup started
INFO 2018-05-16 06:09:18,027 DataCleaner.py:122 - Data cleanup finished
INFO 2018-05-16 06:09:18,028 hostname.py:67 - agent:hostname_script configuration not defined thus read hostname 'mtvm6.eastus.cloudapp.azure.com' using socket.getfqdn().
INFO 2018-05-16 06:09:18,035 PingPortListener.py:50 - Ping port listener started on port: 8670
INFO 2018-05-16 06:09:18,038 main.py:437 - Connecting to Ambari server at https://myhdf.eastus.cloudapp.azure.com:8440 (104.211.60.99)
INFO 2018-05-16 06:09:18,038 NetUtil.py:70 - Connecting to https://myhdf.eastus.cloudapp.azure.com:8440/ca
", None)

Connection to mtvm6.eastus.cloudapp.azure.com closed.
SSH command execution finished
host=mtvm6.eastus.cloudapp.azure.com, exitcode=0
Command end time 2018-05-16 06:09:20

Registering with the server...
Registration with the server failed.
1 ACCEPTED SOLUTION

avatar
Mentor

@Matthias Tewordt

I am happy you have succeeded. Next time you can now help someone with the setup of HDF in Azure 🙂
Yes, the database could be set on any node but as you have already Postgres installed for Ambari it's easier to have the other databases on the same host for easier management.

CAUTION:

When in production think of setting database replication in the future.

Once you have finished the setup If you found this answer addressed your question, please take a moment to log in and click the "Accept" link on the answer.

Keep me posted

View solution in original post

50 REPLIES 50

avatar
Mentor

@Matthias Tewordt

Can you create this directory and ensure the permissions are correct

mkdir -p /var/lib/ambari-agent/data

Then re-run the cluster setup

avatar
Explorer

a new try provided these logs:

- what does "Ambani-Agent received 15 signal, stopping ..." mean ?

- how can I best troubleshoot "NetUtil.py:101 - Failed to connect to https://myhdf.eastus.cloudapp.azure.com:8440/ca due to [Errno 4] Interrupted system call " ?

==========================
Running setup agent script...
==========================

Command start time 2018-05-16 14:32:42
("INFO 2018-05-16 14:32:44,985 main.py:145 - loglevel=logging.INFO
INFO 2018-05-16 14:32:44,993 HeartbeatHandlers.py:84 - Ambari-agent received 15 signal, stopping...
WARNING 2018-05-16 14:32:44,993 NetUtil.py:101 - Failed to connect to https://myhdf.eastus.cloudapp.azure.com:8440/ca due to [Errno 4] Interrupted system call  
WARNING 2018-05-16 14:32:44,993 NetUtil.py:124 - Server at https://myhdf.eastus.cloudapp.azure.com:8440 is not reachable, sleeping for 10 seconds...
INFO 2018-05-16 14:32:44,994 HeartbeatHandlers.py:116 - Stop event received
INFO 2018-05-16 14:32:44,994 NetUtil.py:130 - Stop event received
INFO 2018-05-16 14:32:44,994 ExitHelper.py:56 - Performing cleanup before exiting...
INFO 2018-05-16 14:32:44,994 ExitHelper.py:70 - Cleanup finished, exiting with code:0 

INFO 2018-05-16 14:32:45,016 main.py:283 - Agent died gracefully, exiting.

avatar
Explorer

thanks Geoffrey for your suggestion. I introduced mkdir -p /var/lib/ambari-agent/data with chmod 770 for all 3 nodes. It didn't really help. This is is the head of the "Running Set up script" log of mtvm6:

again a mtvm6 failed to connect to myhdf

==========================
Running setup agent script...
==========================

Command start time 2018-05-16 18:18:03
("INFO 2018-05-16 18:18:05,472 main.py:145 - loglevel=logging.INFO
INFO 2018-05-16 18:18:05,478 HeartbeatHandlers.py:84 - Ambari-agent received 15 signal, stopping...
WARNING 2018-05-16 18:18:05,478 NetUtil.py:101 - Failed to connect to https://myhdf.eastus.cloudapp.azure.com:8440/ca due to [Errno 4] Interrupted system call  
WARNING 2018-05-16 18:18:05,478 NetUtil.py:124 - Server at https://myhdf.eastus.cloudapp.azure.com:8440 is not reachable, sleeping for 10 seconds...
INFO 2018-05-16 18:18:05,478 HeartbeatHandlers.py:116 - Stop event received
INFO 2018-05-16 18:18:05,478 NetUtil.py:130 - Stop event received
INFO 2018-05-16 18:18:05,478 ExitHelper.py:56 - Performing cleanup before exiting...

WARNING 2018-05-16 18:18:06,261 NetUtil.py:124 - Server at https://myhdf.eastus.cloudapp.azure.com:8440 is not reachable, sleeping for 10 seconds...

avatar
Mentor

@Matthias Tewordt

Could you take a backup of cert-verification.cfg

cp  /etc/python/cert-verification.cfg /etc/python/cert-verification.cfg.bak 

Then update /etc/python/cert-verification.cfg to have verify=disable or please create this file

sed -i 's/verify=platform_default/verify=disable/' /etc/python/cert-verification.cfg 

Retry and let me know

avatar
Explorer

Geoffrey, I have updated the cert-verification.cfg as proposed. Still fails with same error messages:

Command start time 2018-05-16 20:33:33
("INFO 2018-05-16 20:33:35,736 main.py:145 - loglevel=logging.INFO
INFO 2018-05-16 20:33:35,742 HeartbeatHandlers.py:84 - Ambari-agent received 15 signal, stopping...
WARNING 2018-05-16 20:33:35,742 NetUtil.py:101 - Failed to connect to https://myhdf.eastus.cloudapp.azure.com:8440/ca due to [Errno 4] Interrupted system call  
WARNING 2018-05-16 20:33:35,743 NetUtil.py:124 - Server at https://myhdf.eastus.cloudapp.azure.com:8440 is not reachable, sleeping for 10 seconds...
INFO 2018-05-16 20:33:35,743 HeartbeatHandlers.py:116 - Stop event received
INFO 2018-05-16 20:33:35,743 NetUtil.py:130 - Stop event received 

INFO 2018-05-16 20:33:35,743 ExitHelper.py:56 - Performing cleanup before exiting...

avatar
Mentor

@Matthias Tewordt

If you have only 3 nodes in your cluster then we could try the manual registration. But before we go that direction did you follow the documented step of preparing the environment

Just to be sure..... If you accomplished the above then, adapt the below for your OS specific. I assume your repos were correctly set and are accessible, you can validate with

# yum repolist

You should be able to see HDP,HDP-UTILS and Ambari repos

Install ambari-agent on all nodes include the Ambari node

# yum install -y ambari-agent

Edit the ambari-agent.ini see example located at /etc/ambari-agent/conf on all the hosts and

[server] 
hostname={Ambari_FQDN} 
url_port=8440 
secured_url_port=8441 
connect_retry_delay=10 
max_reconnect_retry_delay=30

This should hostname should be the Ambari server FQDN

Start the agent

# ambari-agent start

The above should be done on all the 3 nodes and ensure they started successfully

Log on the Ambari UI and follow the previous steps, now in the host registration process chose manual after, you won't need the ssh generated key

The process should complete with GREEN and there you can proceed with your deployment

Please revert


avatar
Explorer

Geoffrey, thanks a lot for your help. I have followed every step in your list and then chose manual registration on the amber ui. It failed again with the following note:

Registering with the server... Registration with the server failed.

avatar
Explorer

Geoffrey, I noted that a manual install on the master actually was successful. Can we conclude that there are connectivity problems ? Ping -c3 10.0.0.5 etc works but I could not successfully telnet from a node to the master. What kind of checks does it need from your perpective beyond successful ssh root@node ?

Thanks again, Matthias

avatar
Mentor

Definitely a conne tivity problem.

Check your /etc/host entry and the DNS resolution

avatar
Explorer

/etc/hosts looks alright on all nodes

host -v -t A mtvm6.eastus.cloudapp.azure.com yields mtvm6.eastus.cloudapp.azure.com. 10 IN A 40.117.251.23 etc.

so to me this looks ok ...

avatar
Mentor

Back to the basics, did you validate the below points?

And check the DNS resolution, The manual registration was successful on the Ambari host but failed on the 2 other nodes because it just can't locate them.

Can you upload the ambari-server.log ?

avatar
Explorer

Geoffey, all done. How can I best upload the amber-server.log from the Azure service ? It is a huge file

avatar
Mentor

@Matthias Tewordt

Can you get it to your local machine and maybe trim the last 200 or so lines or use an external website and provide the link to download because in here the format are restricted and the sizes too

avatar
Explorer

Hi Geoffrey, thanks so much. Can you please share your email address? So I will share the link to Google Drive with Ambari log files. Thanks, Matthias

avatar
Explorer

Geoffey, the Ambari Server is attached. Can you open it ?

ambariserverlog.zip

Thanks, Matthias

avatar
Mentor

@Matthias Tewordt

That's quite bizarre in the logs you attached I see that Ambari is trying to load 2 different repos see below can you explain why?

Centos6

Could not load version definition for HDP-2.6 identified by http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.6.4.0/HDP-2.6.4.0-91.xml. null 

Centos7

Could not load version definition for HDP-2.6 identified by http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.6.5.0/HDP-2.6.5.0-292.xml. null

Check the current repos, you should see ambari,hdp and hdp-utils repos

$ ll /etc/yum.repos.d/

Validate the contents and share their contents

  • cat /etc/yum.repos.d/HDP.repo
  • cat /etc/yum.repos.d/ambari.repo
  • cat /etc/yum.repos.d/HDP-UTILS.repo

Check the OS version in my case its a Centos6 so you can grab the correct repo

cat /etc/redhat-release 
CentOS release 6.9 (Final)

Delete the HDP and HDP-Utils repos in /etc/yum.repos.d

rm -rf  HDP*

Clean the repos

# yum clean all

Validate, you shouldn't see HDP and HDP-Utils except for ambari.repo and some Centos stuff

# yum repolist 

For Ambari version 2.5 and above see the choice(attached ambari-hdp-matrix.jpg)

Download the correct HDP repos

See OS version above in my case Centos 6, the hdp in HDP 2.6 delivers both HDP and HDP-Utils

$ wget -nv http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.6.4.0/hdp.repo -O /tmp/hdp.repo

GPL repo

$ wget -nv http://public-repo-1.hortonworks.com/HDP-GPL/centos6/2.x/updates/2.6.4.0/hdp.gpl.repo -O /tmp/hdp.gpl.repo

After the above revalidate, you should be able to see new HDP*/ repos

# yum repolist

Now restart the cluster deployment


ambari-hdp-matrix.jpg

avatar
Explorer

Geoffrey, I am running Red Hat RHEL 7. So don't you think I should opt for the CentOS 7 repos ?

avatar
Mentor

@Matthias Tewordt

Yes definitely that's why I wanted you to match my output to your OS version!

The was also this error what is your processor?

"cannot resolve OS centos7-ppc to the supported ones: suse12,suse11,redhat7,debian7,redhat6,ubuntu14,ubuntu12. Family: null"

Please revert

avatar
Explorer

here is a list of the files in yum.repos.d. Why remove all HDP repos ?? I suggest to eliminate the 2 public-repo-1... repos and leave the hdp.repo (see RHEL7 content below).

-rw-r--r--. 1 root root 306 May 31 2017 ambari.repo

-rw-r--r--. 1 root root 574 Jan 8 06:49 hdp.repo

-rw-r--r--. 1 root root 296 May 17 06:36 public-repo-1.hortonworks.com_HDP_centos7_2.x_updates_2.6.4.0_HDP-2.6.4.0-91.xml.repo

-rw-r--r--. 1 root root 242 May 17 06:09 public-repo-1.hortonworks.com_HDP-UTILS-1.1.0.22_repos_centos7.repo

-rw-r--r--. 1 root root 358 Jan 5 00:08 redhat.repo

-rw-r--r--. 1 root root 13955 Dec 3 2016 rh-cloud.repo

Here is the content of the hdp.repo : it looks alright to me ?

#VERSION_NUMBER=2.6.4.0-91

[HDP-2.6.4.0]

name=HDP Version - HDP-2.6.4.0

baseurl=http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.6.4.0

gpgcheck=1

gpgkey=http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.6.4.0/RPM-GPG-KEY/RPM-GPG-KEY-Jenkins

enabled=1

priority=1

[HDP-UTILS-1.1.0.22]

name=HDP-UTILS Version - HDP-UTILS-1.1.0.22

baseurl=http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.22/repos/centos7

gpgcheck=1

gpgkey=http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.6.4.0/RPM-GPG-KEY/RPM-GPG-KEY-Jenkins

enabled=1

priority=1

avatar
Explorer

Regarding the processor: it is a DS1 v2 (1 vCPU 3.5GB): Dv2-Series instances are based on the latest generation 2.4 GHz Intel Xeon® E5-2673 v3 (Haswell) processor, and with Intel Turbo Boost Technology 2.0 can go to 3.2 GHz. Dv2-Series and D-Series are ideal for applications that demand faster CPUs, better local disk performance, or higher memories, and offer a powerful combination for many enterprise-grade applications.

Labels