Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Ambari shows some of the service components "heartbeat lost"

Highlighted

Ambari shows some of the service components "heartbeat lost"

New Contributor

@Geoffrey Shelton Okot

1 System environment:

Two namenodes with HDFS HA configuration, 8 datanodes.

All nodes are CentOS 6.5 64-bit, with Python 2.6.6 and jdk1.8.0_144

The cluster is installed with HDP 2.6.5, in which the ambari-server version is 2.6.2.2.

Ambari-server is installed on one of the namenodes.

All 10 nodes, including the one with ambari-server installed, have been installed with ambari-agent.

All nodes are configured with fqdn, without SSL setup.

Passwordless ssh login is also configured on ambari-server node.

All ambari-server and ambari-agents are run under user root.

2 The problem:

2.1 On the ambari-server web UI, there are a lot of "Heartbeat Lost". (Not all the service, some services are OK.)

2.2 On the ambari-server web UI, I cannot perform any service component management. If I fire a management action on the page, nothing will happen except a operation page with all yellow progress bars and nothing logged to file. Restart service, install new service, move service, none of them can be done. See fig2.2a and fig2.2b

2.3 The ambari-server and all the ambari-agents are checked to be running. All nodes can access each other. A hadoop cluster is actually running (see below, 2.5), it seems that just ambari-server does not know they are running.

2.4 I have tried to restart all the ambari-agents and ambari-server, including a start-stop on postgres service, no help.

2.5 I commented out the lines with jdk-1.7, leaves only jdk-1.8 in ambari.properties of the ambari-server, and restarted ambari-server and all the agents, no help.

2.6 I have manually started hdfs and yarn on the cluster (as well as JournalNodes), and it's successful. I can get node info on HDFS Namenode web UI page and YARN ResourceManager UI page. And I'm able to upload files onto HDFS via copyFromLocal. I think I could be able to lunch some mapreduce job if needed. But the ambari-server page says there are only three NodeManagers live for yarn, while on the ResourceManager web UI there are all 8 of them up running . See fig2.6

2.7 The ambari-server log says there are some service state is UNKNOWN. I tried the REST API to set directly some service state to be INSTALLED (I was planning to take the UNKNOWN - INSTALLED - STARTED states step if it could work.), but got DENY on the response, and nothing changed. In the below example, ZOOKEEPER_SERVER is actually up running on that datanode, but ambari-server dose not know and shows "Heartbeat Lost" on the page. I also have checked zookeeper log, and did not get a clue.

# curl --user admin:admin H "X-Requested-By:ambari" -i -XPUT -d '{"RequestInfo":{"context":"Install ZOOKEEPER_SERVER"},"Body":{"HostRoles":{"state":"INSTALLED"}}}' http://master0.hadoop.csm.cn:8080/api/v1/clusters/csmhadoop/hosts/datanode02.hadoop.csm.cn/host_comp...


HTTP/1.1 200 OK
X-Frame-Options: DENY
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Cache-Control: no-store
Pragma: no-cache
Set-Cookie: AMBARISESSIONID=fasd2typkj4q17iiwnjmdpvtc;Path=/;HttpOnly
Expires: Thu, 01 Jan 1970 00:00:00 GMT
User: admin
Content-Type: text/plain
Content-Length: 0


3 How happened:

The cluster has been up running for about a month.

I once flood the cluster with several heavy mapreduce jobs, which caused Ambari Metrics Collector service failed according to ambari-server web page. I tried to move the Ambari Metrics Collector to the standby namenode (which looks less-burdened), and the problems start to kick in. At first I lost one of the datanode, from which the mapreduce jobs were lunched. Then I tried to stop-start the ambari-server and ambari-agents (on all nodes).

The outcome is confusing: On the ambari-server web page, it says the HDFS service is not healthy, service components on the lost datanode cannot be found. So I restart those components manually according to hdp reference. Ambari-server still did not know they are up running afterwards. But I can see they are running on native HDFS Namenode web UI and ResourceManager web UI, as mentioned previously.

After doing start - stop -start on ambari-server and ambari-agents back and force a couple of times, ambari-server finally got it right somehow. (I don't understand.) But I still cannot perform any management action.

So what should I do to get things right? There are 30TB data on HDFS already, if re-install ambari-server and all the ambari-agents is the only solution, can that be done without having any negative effects on the data and HDFS cluster?

Attached is server log with DEBUG log level and one of the agent log. The cluster is HA configured with 8 datanodes.

ambari-server.zip

ambari-agent.zip