Support Questions
Find answers, ask questions, and share your expertise

HeartBeat Lost loss for all services

Explorer

Hi All,

We are using hdp 2.3. Today morning when i stepped in office. I saw that services are in UNKNOWN state. this is QA cluster sp even after restart and reboot, killing ambari-agent, ambari-server, postgresql restart is not helping me.

Here is the screenshot and logs.

ambari.jpg

Logs are here

======================================================

WARN [ambari-hearbeat-monitor] HeartbeatMonitor:154 - Heartbeat lost from host localhost.localdomain WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component METRICS_MONITOR on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component METRICS_COLLECTOR on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component HBASE_MASTER on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component HBASE_REGIONSERVER on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component PHOENIX_QUERY_SERVER WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component SECONDARY_NAMENODE on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component DATANODE on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component NAMENODE on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component HIVE_SERVER on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component MYSQL_SERVER on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component HIVE_METASTORE on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component WEBHCAT_SERVER on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component KAFKA_BROKER on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component HISTORYSERVER on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component SPARK_JOBHISTORYSERVE WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component NODEMANAGER on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component APP_TIMELINE_SERVER o WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component RESOURCEMANAGER on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component ZEPPELIN_MASTER on WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component ZOOKEEPER_SERVER on

=======================================================

Kindly suggest.

I am not sure if state changed by ambari-api ? If so, How can I track/check the same.

Thanks in advance.

Harshal

17 REPLIES 17

Mentor

Can you confirm ambari agent is up?

Explorer

Hi @Artem Ervits ambari agent is up and running.

Also getting above error at restart for each ambari component

On host localhost.localdomain role HIVE_METASTORE in invalid state.
Invalid transition. Invalid event: HOST_SVCCOMP_OP_IN_PROGRESS at UNKNOWN

Mentor

@Harshal Joshi

An Ambari managed cluster should be stopped gracefully just like an oracle database you . A reboot is the equivalent of shutdown abort in Oracle.When you reboot your cluster its advisable to start the components manually in the order Ambari server,HDFS,YARN

Otherwise have a look at this link

Explorer

Hi @Geoffrey Shelton Okot, I have reboot it after the issue. cluster was already down.

On host localhost.localdomain role HIVE_METASTORE in invalid state.
Invalid transition. Invalid event: HOST_SVCCOMP_OP_IN_PROGRESS at UNKNOWN

also getting above error for all componant

@Harshal Joshi

How many nodes are in the cluster? Is it a sandbox? Please check if the ambari-agent is indeed coming up. Compare /var/run/ambari-agent/ambari-agent.pid with the process running. Take a look at this article.

Explorer

Hi @vpoornalingam agent is up and running also matching PID. Cluster is single node cluster

I am also getting for all component

On host localhost.localdomain role HIVE_METASTORE in invalid state.
Invalid transition. Invalid event: HOST_SVCCOMP_OP_IN_PROGRESS at UNKNOWN

Mentor

@Harshal Joshi

For the Host-config-is-in-invalid-state. Please have a look at this post great API's for changing the state of a service component Link

Expert Contributor

It seems that ambari-server has lost connection with ambari-agent somehow.

Try these steps :

1. Stop ambari-server

2. Stop ambari-agent service on all nodes

3. Start ambari-agent service on all nodes

4. Start ambari-server server

View logs of ambari-server and ambari-agent and see if it throws any error other than component in UNKOWN State.

Explorer

Hi Harshal,Were you able to fix this?Am facing same issue.

Explorer

Any luck..Am facing same issue..!!

Explorer

Seems IT changed domain name, updated /etc/host and resolv.conf to reflect old fqdn name.But restart cluster was failing with

Failed on local exception: java.io.IOException: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: nn/mxspdh10.amdocs.com@MXSPDH10.KERBEROS.COM; Host Details : local host is: "mxspdh10.mx.amdocs.com/135.208.66.57"; destination host is: "mxspdh10.amdocs.com":8020;

Local host is .mx.amdocs.com and expected was amdocs.com.

After resolv.conf chnage ,tried rebooting cluster ,but post that facing ambari-agent heartbeat issues.Same error as shared in the forum.Suggession pls.

Rising Star

Your cluster is kerberized?

You need create new keytab for all service and reset all option on ambari if you change domain..

New Contributor

You can try logging into the admin user and restart datanodes from the actions bar in Dashboard.

That worked for me. May work for you too.

Rising Star

I had the same error time ago.

First, verify /etc/hosts, then verify the ambari-node able to connect to all nodes, and that all nodes able to connect to ambari-node (like ping or ssh connect). Then I had resolved by resetting all agents (that I had stopped before):

ambari-agent reset <Ambari-server-hostname> 

At next restart agents have started to successfully transmit information. I hope it can help you

Explorer

its ambari 1.6 ,reset is post ambari2.1

I checked few things:

I see below rows in ambari-postgres DB

ambari=# select host_name from ambari.hosts; host_name --------------------- mxspdh16.amdocs.com mxspdh10.amdocs.com mxspdh18.amdocs.com mxspdh17.amdocs.com mxspdh10.mx.amdocs.com (5 rows)

ambari=# select * from ambari.hoststate ; agent_version | available_mem | current_state | health_status | host_name | time_in_state | maintenance_state

---------------------+---------------+---------------+----------------------------------------------+---------------------+---------------+------------------ - {"version":"1.7.0"} | 31623084 | INIT | {"healthStatus":"HEALTHY","healthReport":""} | mxspdh10.mx.amdocs.com | 1463041598424 | {"version":"1.7.0"} | 31792512 | INIT | {"healthStatus":"UNKNOWN","healthReport":""} | mxspdh10.amdocs.com | 1462368178266 | {"4":"OFF"} {"version":"1.7.0"} | 28241364 | INIT | {"healthStatus":"HEALTHY","healthReport":""} | mxspdh16.amdocs.com | 1463040523426 | {"version":"1.7.0"} | 28890788 | INIT | {"healthStatus":"HEALTHY","healthReport":""} | mxspdh18.amdocs.com | 1463040527465 | {"version":"1.7.0"} | 29281736 | INIT | {"healthStatus":"HEALTHY","healthReport":""} | mxspdh17.amdocs.com | 1463040528044 | (5 rows) ambari=# delete from ambari.hoststate where host_name='mxspdh10.mx.amdocs.com'; DELETE 1

I delete both rows. But on restart ambari these two rows again gets populated. Please see why we are getting mxspdh10.mx.amdocs.com ???

  • We need to make sure below point in heart beat lost host.
  • check firewall status. It should be stop.

service iptables stop

  • check /etc/ambari-agent/conf/ambari-agent.ini file.

in ambari-agent file hostname entry should be

hostname = ambariservernodehost

ambariservernodehost should be present in /etc/hosts file

  • openssl version should be upgraded.
  • Stop ambari-server
  • Stop ambari-agent service on all nodes
  • Start ambari-agent service on all nodes
  • Start ambari-server server

check logs of ambari agent. If even there is problem then please reply me.

New Contributor

Sometimes after upgrade you need to check whether your ambari-agent and ambari-server versions are same or at least compatible.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.