Created 04-17-2018 06:57 AM
Background: first time installing hadoop on a cluster single-handedly. AWS 3 instances, Ubuntu 16, 1 XL node 2 L nodes, all with 40 Gb storage each. Latest version of HDP 2.6.4. Installed successfully, started ambari successfully, logged into web admin portal ... Followed all the steps in the documentation.
Problem: most services failed to start ("heartbeat lost"), 39 alerts. Tried starting services manually via the web portal, tried restarting the ambari server, did not help. Below are some of the alerts:
HDFS NameNode Web UI:
Connection failed to http://ec2-18-217-xxxx.us-east-2.compute.amazonaws.com:50070 (<urlopen error [Errno 111] Connection refused>)
Yarn App Timeline Web UI:
Connection failed to http://ec2-18-218-xxxx.us-east-2.compute.amazonaws.com:8188/ws/v1/timeline (<urlopen error [Errno 111] Connection refused>) CRIT
MapReduce2 History Server Process
Connection failed: [Errno 111] Connection refused to ec2-18-218-xxxx.us-east-2.compute.amazonaws.com:19888
... and many more
Created 04-17-2018 12:03 PM
See this link should help you with the HDP AWS connectivity
https://community.hortonworks.com/answers/105662/view.html
Hope that helps
Created 04-17-2018 04:54 PM
Thanks, taking a look.
Created 04-18-2018 03:26 AM
@Geoffrey Shelton Okot Reviewed the link. Thank you for taking a look at this.
Best answer in the link just provides a link to general AWS documentation on Elastic IP and VPC DNS. I did not find reasons nor solutions for the issue I detailed here.
I've assigned elastic IPs to my cluster prior to the installation of the Ambari. Installation completed with no issues. Services are not starting due to failed connections, some of which are detailed in my original post. Same elastic IPs have been assigned to the cluster since its launch.
Is there a way to trouble-shoot these connections? Why the setup did not configure these connections automatically?
Is there a way/steps to verify that everything is properly configured on the cluster for the above connections to work?
Created 04-18-2018 09:12 AM
I am interested to know what's the contents of your /etc/hosts?
Did you configure the passwordless connect between your Ambari-server and all the hosts in the cluster?
Created 04-18-2018 04:59 PM
@Geoffrey Shelton Okot thanks for responding and giving the pointers.
I did configure and test the passwordless connect between the hosts in the cluster. I can ssh to any host, from any host on the cluster without entering the password.
I did folllow the procedure from the installation guide for setting up /etc/hosts, but I will post the contents of /etc/hosts later today.
Created 04-18-2018 07:52 PM
Just to eliminate again some doubts did you install the ambari-agents using ambari or manually if the later can you check that the ambari-agent.ini has the correct entry for the ambari-server (FQDN)?
In your /etc/hosts are you using the private of public IP's can the ip's/hostnames be resolved by DNS ?
Created 04-19-2018 12:23 AM
I've installed everything (assuming agents as well) using ambari. I have not done any manual setup, to my knowledge. Below are the contents the /etc/hosts and /etc/ambari-agent/conf/ambari-agent.ini
Contents of /etc/hosts (I've masked last two numbers here with x for security):
127.0.0.1 localhost
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
18.217.xxx.xx ec2-18-217-xxx-xx.us-east-2.compute.amazonaws.com
18.218.xxx.xx ec2-18-218-xxx-xx.us-east-2.compute.amazonaws.com
52.15.xxx.xx ec2-52-15-xxx-xx.us-east-2.compute.amazonaws.com
========================================
Contents of /etc/ambari-agent/conf/ambari-agent.ini (host name below is set to the aws private IP that the 1st elastic IP in /etc/hosts above is pointing to, 18.217.xxx.xx, I have not set it anywhere myself, so this must have been determined by ambari setup process automatically. What do you recommend? Should I manually change it to the elastic IP from /etc/hosts, on each node? Any other files to check?):
[server]
hostname=ip-172-31-10-118.us-east-2.compute.internal
url_port=8440
secured_url_port=8441
connect_retry_delay=10
max_reconnect_retry_delay=30
[agent]
logdir=/var/log/ambari-agent
piddir=/var/run/ambari-agent
prefix=/var/lib/ambari-agent/data
;loglevel=(DEBUG/INFO)
loglevel=INFO
data_cleanup_interval=86400
data_cleanup_max_age=2592000
data_cleanup_max_size_MB =100
ping_port=8670
cache_dir=/var/lib/ambari-agent/cache
tolerate_download_failures=true
run_as_user=root
parallel_execution=0
alert_grace_period=5
status_command_timeout=5
alert_kinit_timeout=14400000
system_resource_overrides=/etc/resource_overrides
; memory_threshold_soft_mb=400
; memory_threshold_hard_mb=1000
; ignore_mount_points=/mnt/custom1,/mnt/custom2
[security]
keysdir=/var/lib/ambari-agent/keys
server_crt=ca.crt
passphrase_env_var_name=AMBARI_PASSPHRASE
ssl_verify_cert=0
credential_lib_dir=/var/lib/ambari-agent/cred/lib
credential_conf_dir=/var/lib/ambari-agent/cred/conf
credential_shell_cmd=org.apache.hadoop.security.alias.CredentialShell
[network]
; this option apply only for Agent communication
use_system_proxy_settings=true
[services]
pidLookupPath=/var/run/
[heartbeat]
state_interval_seconds=60
dirs=/etc/hadoop,/etc/hadoop/conf,/etc/hbase,/etc/hcatalog,/etc/hive,/etc/oozie,
/etc/sqoop,
/var/run/hadoop,/var/run/zookeeper,/var/run/hbase,/var/run/templeton,/var/run/oozie,
/var/log/hadoop,/var/log/zookeeper,/var/log/hbase,/var/run/templeton,/var/log/hive
; 0 - unlimited
log_lines_count=300
idle_interval_min=1
idle_interval_max=10
[logging]
syslog_enabled=0
Created 04-20-2018 03:37 AM
Changed ambari-agent.ini on all the nodes, replaced the private IP of the master node with elastic IP of the master node. Stopped and started the ambari server. This did not help, same issue continues. Issue is not resolved.
Any other checks/trouble-shooting options?