Created on 06-08-2018 03:39 AM - edited 08-17-2019 09:05 PM
Unfortunately I terminated a slave instance in my hcp cluster which was hosting hive server 2, hive metastore and mysql db. In my ambari UI I am getting heart beat lost issue in the services which were in that instance. To fix this I tried adding new host to bring back my services using Host - > Add new Host. I followed below steps for this.
1 - Create new EC2 instance - Cent os 7 - Same as my other instances.
2- Installed yum update & epel repo adding
3- Setup password less authentication from Ambari server to the new Host
4- Filled step 1 parameters - Private key and host ip for the new instance
After step 4 ambari UI is stuck ( Screen catpure - suck.png) and not going to futher step. I checked both ambari-agent & ambari server log but couldn't find any issues. What could be the reason for this ? How can I resolve or futher investigate ?
Ambari agent log :
INFO 2018-06-08 03:30:29,933 Controller.py:304 - Heartbeat (response id = 30) with server is running... INFO 2018-06-08 03:30:29,933 Controller.py:311 - Building heartbeat message INFO 2018-06-08 03:30:29,934 Heartbeat.py:90 - Adding host info/state to heartbeat message. INFO 2018-06-08 03:30:29,989 logger.py:75 - Testing the JVM's JCE policy to see it if supports an unlimited key length. INFO 2018-06-08 03:30:30,001 Hardware.py:176 - Some mount points were ignored: /, /dev, /dev/shm, /run, /sys/fs/cgroup, /run/user/1000, /run/user/0 INFO 2018-06-08 03:30:30,001 Controller.py:320 - Sending Heartbeat (id = 30) INFO 2018-06-08 03:30:30,003 Controller.py:333 - Heartbeat response received (id = 31) INFO 2018-06-08 03:30:30,003 Controller.py:342 - Heartbeat interval is 10 seconds INFO 2018-06-08 03:30:30,003 Controller.py:380 - Updating configurations from heartbeat INFO 2018-06-08 03:30:30,003 Controller.py:389 - Adding cancel/execution commands INFO 2018-06-08 03:30:30,003 Controller.py:406 - Adding recovery commands INFO 2018-06-08 03:30:30,003 Controller.py:475 - Waiting 9.9 for next heartbeat INFO 2018-06-08 03:30:39,904 Controller.py:482 - Wait for next heartbeat over
Ambari server log :
and will be failed 08 Jun 2018 03:26:00,590 INFO [ambari-action-scheduler] ActionScheduler:809 - Removing command from queue, host=ip-172-31-18-247.ec2.internal, commandId=1326-0 08 Jun 2018 03:26:00,590 WARN [ambari-action-scheduler] ExecutionCommandWrapper:225 - Unable to lookup the cluster by ID; assuming that there is no cluster and therefore no configs for this execution command: Cluster not found, clusterName=clusterID=-1 08 Jun 2018 03:26:01,593 WARN [ambari-action-scheduler] ActionScheduler:782 - Host: ip-172-31-18-247.ec2.internal, role: check_host, actionId: 1326-0 expired and will be failed 08 Jun 2018 03:26:01,595 INFO [ambari-action-scheduler] ActionScheduler:809 - Removing command from queue, host=ip-172-31-18-247.ec2.internal, commandId=1326-0 08 Jun 2018 03:26:01,595 WARN [ambari-action-scheduler] ExecutionCommandWrapper:225 - Unable to lookup the cluster by ID; assuming that there is no cluster and therefore no configs for this execution command: Cluster not found, clusterName=clusterID=-1 08 Jun 2018 03:26:02,077 INFO [qtp-ambari-agent-44] HeartBeatHandler:292 - HeartBeatHandler.sendCommands: sending ExecutionCommand for host ip-172-31-27-147.ec2.internal, role check_host, roleCommand ACTIONEXECUTE, and command ID 1326-0, task ID 12200 08 Jun 2018 03:26:02,599 WARN [ambari-action-scheduler] ActionScheduler:782 - Host: ip-172-31-18-247.ec2.internal, role: check_host, actionId: 1326-0 expired and will be failed 08 Jun 2018 03:26:02,601 INFO [ambari-action-scheduler] ActionScheduler:809 - Removing command from queue, host=ip-172-31-18-247.ec2.internal, commandId=1326-0 08 Jun 2018 03:26:02,601 WARN [ambari-action-scheduler] ExecutionCommandWrapper:225 - Unable to lookup the cluster by ID; assuming that there is no cluster and therefore no configs for this execution command: Cluster not found, clusterName=clusterID=-1
Created 06-30-2018 08:26 AM
Looks like ambari-server is stuck executing host checks on the host. You can restart ambari-server and ambari-agent with the -debug flag in the command. This will help in nailing down the problem further.
Created 07-02-2018 08:33 AM
Once u have started ambari-server in debug mode.
please check the following in new host agent log.
2018-06-20 17:30:03,018 - IP address forward resolution check started. 2018-06-20 17:30:03,018 - All hosts resolved to an IP address. 2018-06-20 17:30:03,018 - IP address forward resolution check completed. 2018-06-20 17:30:03,019 - Host checks completed. 2018-06-20 17:30:03,019 - Structured output: {'host_resolution_check': {'failed_count': 0, 'exit_code': 0, 'success_count': 21, 'failures': [], 'message': 'All hosts resolved to an IP address.', 'hosts_with_failures': []}} 2018-06-20 17:30:03,019 - Action afix 'post_actionexecute' not present
If it is still stuck at host-check.
the agent log should contain
2018-06-20 17:31:25,502 DEBUG [ambari-client-thread-86] BaseProvider:331 - Skipping property for resource as not in requestedIds, resourceType=Task, propertyId=Tasks/role, value=check_host 2018-06-20 17:31:25,503 DEBUG [ambari-client-thread-86] BaseProvider:331 - Skipping property for resource as not in requestedIds, resourceType=Task, propertyId=Tasks/command, value=ACTIONEXECUTE 2018-06-20 17:31:25,503 DEBUG [ambari-client-thread-86] BaseProvider:308 - Setting property for resource, resourceType=Task, propertyId=Tasks/status, value=QUEUED
Please revert what do you see in agent logs??
Created 07-02-2018 08:35 AM
also check the usual stuff like
1. /etc/hosts/,
2. stop iptables,
3.telnet to 8440 from agents,
4. connection established at 8441