Support Questions

Find answers, ask questions, and share your expertise

HeartBeat Lost loss for all services

avatar
Contributor

Hi All,

We are using hdp 2.3. Today morning when i stepped in office. I saw that services are in UNKNOWN state. this is QA cluster sp even after restart and reboot, killing ambari-agent, ambari-server, postgresql restart is not helping me.

Here is the screenshot and logs.

ambari.jpg

Logs are here

======================================================

WARN [ambari-hearbeat-monitor] HeartbeatMonitor:154 - Heartbeat lost from host localhost.localdomain WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component METRICS_MONITOR on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component METRICS_COLLECTOR on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component HBASE_MASTER on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component HBASE_REGIONSERVER on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component PHOENIX_QUERY_SERVER WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component SECONDARY_NAMENODE on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component DATANODE on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component NAMENODE on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component HIVE_SERVER on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component MYSQL_SERVER on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component HIVE_METASTORE on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component WEBHCAT_SERVER on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component KAFKA_BROKER on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component HISTORYSERVER on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component SPARK_JOBHISTORYSERVE WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component NODEMANAGER on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component APP_TIMELINE_SERVER o WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component RESOURCEMANAGER on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component ZEPPELIN_MASTER on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component ZOOKEEPER_SERVER on

=======================================================

Kindly suggest.

I am not sure if state changed by ambari-api ? If so, How can I track/check the same.

Thanks in advance.

Harshal

17 REPLIES 17

avatar
Explorer

Any luck..Am facing same issue..!!

avatar
Explorer

Seems IT changed domain name, updated /etc/host and resolv.conf to reflect old fqdn name.But restart cluster was failing with

Failed on local exception: java.io.IOException: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: nn/mxspdh10.amdocs.com@MXSPDH10.KERBEROS.COM; Host Details : local host is: "mxspdh10.mx.amdocs.com/135.208.66.57"; destination host is: "mxspdh10.amdocs.com":8020;

Local host is .mx.amdocs.com and expected was amdocs.com.

After resolv.conf chnage ,tried rebooting cluster ,but post that facing ambari-agent heartbeat issues.Same error as shared in the forum.Suggession pls.

avatar
Expert Contributor

Your cluster is kerberized?

You need create new keytab for all service and reset all option on ambari if you change domain..

avatar
New Contributor

You can try logging into the admin user and restart datanodes from the actions bar in Dashboard.

That worked for me. May work for you too.

avatar
Expert Contributor

I had the same error time ago.

First, verify /etc/hosts, then verify the ambari-node able to connect to all nodes, and that all nodes able to connect to ambari-node (like ping or ssh connect). Then I had resolved by resetting all agents (that I had stopped before):

ambari-agent reset <Ambari-server-hostname> 

At next restart agents have started to successfully transmit information. I hope it can help you

avatar
Explorer

its ambari 1.6 ,reset is post ambari2.1

I checked few things:

I see below rows in ambari-postgres DB

ambari=# select host_name from ambari.hosts; host_name --------------------- mxspdh16.amdocs.com mxspdh10.amdocs.com mxspdh18.amdocs.com mxspdh17.amdocs.com mxspdh10.mx.amdocs.com (5 rows)

ambari=# select * from ambari.hoststate ; agent_version | available_mem | current_state | health_status | host_name | time_in_state | maintenance_state

---------------------+---------------+---------------+----------------------------------------------+---------------------+---------------+------------------ - {"version":"1.7.0"} | 31623084 | INIT | {"healthStatus":"HEALTHY","healthReport":""} | mxspdh10.mx.amdocs.com | 1463041598424 | {"version":"1.7.0"} | 31792512 | INIT | {"healthStatus":"UNKNOWN","healthReport":""} | mxspdh10.amdocs.com | 1462368178266 | {"4":"OFF"} {"version":"1.7.0"} | 28241364 | INIT | {"healthStatus":"HEALTHY","healthReport":""} | mxspdh16.amdocs.com | 1463040523426 | {"version":"1.7.0"} | 28890788 | INIT | {"healthStatus":"HEALTHY","healthReport":""} | mxspdh18.amdocs.com | 1463040527465 | {"version":"1.7.0"} | 29281736 | INIT | {"healthStatus":"HEALTHY","healthReport":""} | mxspdh17.amdocs.com | 1463040528044 | (5 rows) ambari=# delete from ambari.hoststate where host_name='mxspdh10.mx.amdocs.com'; DELETE 1

I delete both rows. But on restart ambari these two rows again gets populated. Please see why we are getting mxspdh10.mx.amdocs.com ???

avatar
  • We need to make sure below point in heart beat lost host.
  • check firewall status. It should be stop.

service iptables stop

  • check /etc/ambari-agent/conf/ambari-agent.ini file.

in ambari-agent file hostname entry should be

hostname = ambariservernodehost

ambariservernodehost should be present in /etc/hosts file

  • openssl version should be upgraded.
  • Stop ambari-server
  • Stop ambari-agent service on all nodes
  • Start ambari-agent service on all nodes
  • Start ambari-server server

check logs of ambari agent. If even there is problem then please reply me.

avatar

Sometimes after upgrade you need to check whether your ambari-agent and ambari-server versions are same or at least compatible.