Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

ambari hearbeat lost after a restart of VM

Highlighted

ambari hearbeat lost after a restart of VM

Contributor

Dear experts,

I have installed a 3 node HDP cluster on Azure. Due to some problem, all the VMs got restarted abruptly, after this i have manually restarted all the ambari agents and ambari server. I am able to see the agents and server is running fine but all of the services are in "Lost heartbeat"state. Could you please assist ?

Below is the log file of a Ambari agent on master node:

INFO 2018-05-29 05:57:21,051 Controller.py:311 - Building heartbeat message INFO 2018-05-29 05:57:21,053 Heartbeat.py:90 - Adding host info/state to heartbeat message. INFO 2018-05-29 05:57:21,149 logger.py:75 - Testing the JVM's JCE policy to see it if supports an unlimited key length. INFO 2018-05-29 05:57:21,149 logger.py:75 - Testing the JVM's JCE policy to see it if supports an unlimited key length. INFO 2018-05-29 05:57:21,995 Hardware.py:176 - Some mount points were ignored: /, /dev, /dev/shm, /, /mnt/resource INFO 2018-05-29 05:57:21,996 Controller.py:320 - Sending Heartbeat (id = 204) INFO 2018-05-29 05:57:22,039 Controller.py:333 - Heartbeat response received (id = 205) INFO 2018-05-29 05:57:22,040 Controller.py:342 - Heartbeat interval is 10 seconds INFO 2018-05-29 05:57:22,040 Controller.py:380 - Updating configurations from heartbeat INFO 2018-05-29 05:57:22,040 Controller.py:389 - Adding cancel/execution commands INFO 2018-05-29 05:57:22,040 Controller.py:475 - Waiting 9.9 for next heartbeat INFO 2018-05-29 05:57:31,941 Controller.py:482 - Wait for next heartbeat over WARNING 2018-05-29 05:57:46,099 base_alert.py:138 - [Alert][namenode_cpu] Unable to execute alert. [Alert][namenode_cpu] Unable to extract JSON from JMX response WARNING 2018-05-29 05:57:46,108 base_alert.py:138 - [Alert][datanode_health_summary] Unable to execute alert. [Alert][datanode_health_summary] Unable to extract JSON from JMX response WARNING 2018-05-29 05:57:46,119 base_alert.py:138 - [Alert][namenode_service_rpc_processing_latency_hourly] Unable to execute alert. Couldn't define hadoop_conf_dir: argument of type 'NoneType' is not iterable WARNING 2018-05-29 05:57:46,123 base_alert.py:138 - [Alert][namenode_client_rpc_queue_latency_hourly] Unable to execute alert. Couldn't define hadoop_conf_dir: argument of type 'NoneType' is not iterable WARNING 2018-05-29 05:57:46,136 base_alert.py:138 - [Alert][namenode_client_rpc_processing_latency_hourly] Unable to execute alert. Couldn't define hadoop_conf_dir: argument of type 'NoneType' is not iterable WARNING 2018-05-29 05:57:46,145 base_alert.py:138 - [Alert][namenode_directory_status] Unable to execute alert. [Alert][namenode_directory_status] Unable to extract JSON from JMX response WARNING 2018-05-29 05:57:46,267 base_alert.py:138 - [Alert][namenode_service_rpc_queue_latency_hourly] Unable to execute alert. Couldn't define hadoop_conf_dir: argument of type 'NoneType' is not iterable WARNING 2018-05-29 05:57:46,280 base_alert.py:138 - [Alert][yarn_resourcemanager_cpu] Unable to execute alert. [Alert][yarn_resourcemanager_cpu] Unable to extract JSON from JMX response WARNING 2018-05-29 05:57:46,282 base_alert.py:138 - [Alert][yarn_resourcemanager_rpc_latency] Unable to execute alert. [Alert][yarn_resourcemanager_rpc_latency] Unable to extract JSON from JMX response WARNING 2018-05-29 05:57:46,296 base_alert.py:138 - [Alert][smartsense_gateway_status] Unable to execute alert. [Alert][smartsense_gateway_status] Unable to extract JSON from JMX response WARNING 2018-05-29 05:57:46,298 logger.py:71 - Cannot find the stack name in the command. Stack tools cannot be loaded WARNING 2018-05-29 05:57:46,300 base_alert.py:138 - [Alert][smartsense_long_running_bundle] Unable to execute alert. [Alert][smartsense_long_running_bundle] Unable to extract JSON from JMX response WARNING 2018-05-29 05:57:46,298 logger.py:71 - Cannot find the stack name in the command. Stack tools cannot be loaded INFO 2018-05-29 05:57:46,303 logger.py:75 - call[('ambari-python-wrap', None, 'versions')] {} INFO 2018-05-29 05:57:46,303 logger.py:75 - call[('ambari-python-wrap', None, 'versions')] {} INFO 2018-05-29 05:57:46,712 logger.py:75 - Pid file /var/run/ambari-metrics-monitor/ambari-metrics-monitor.pid is empty or does not exist INFO 2018-05-29 05:57:46,712 logger.py:75 - Pid file /var/run/ambari-metrics-monitor/ambari-metrics-monitor.pid is empty or does not exist ERROR 2018-05-29 05:57:46,713 script_alert.py:123 - [Alert][ams_metrics_monitor_process] Failed with result CRITICAL: ['Ambari Monitor is NOT running on hdpmaster'] ERROR 2018-05-29 05:57:46,713 script_alert.py:123 - [Alert][ams_metrics_monitor_process] Failed with result CRITICAL: ['Ambari Monitor is NOT running on hdpmaster']

6 REPLIES 6

Re: ambari hearbeat lost after a restart of VM

Mentor

@Chiranjeevi Nimmala

I think you shouldn't have started the agents manually "Ambari Monitor is NOT running on hdpmaster" there is no /var/run/ambari-metrics-monitor/ambari-metrics-monitor.pid file.

There is a start order for the Hadoop services, so use Ambari UI Start-all instead of individually starting the services. Ambari usually auto starts under Linux if not you could tweak it to do soo see example on RHEL/Centos 7


Re: ambari hearbeat lost after a restart of VM

Contributor

@Geoffrey Shelton Okot

All my VMs are running SLES 11 SP 4 and from ambari , the start-all option is disabled. It is showing heartbeat lost.

Re: ambari hearbeat lost after a restart of VM

Mentor

@Chiranjeevi Nimmala

Do the following steps Start the ambari manually ensure it starts correctly

# ambari-server start

Check Ambari is running

# ambari-server status

Then start the HDP services using API, create a shell script start-all-services.sh with below contents

#! /bin/bash
# Start all HDP services
curl -u admin:admin -H "X-Requested-By:ambari" -X PUT -d '{"RequestInfo":{"context": "Start all Services"},"Body":{"ServiceInfo":{"state":"STARTED"}}}' http://{ambari-server}:8080/api/v1/clusters/{your_cluster}/services?ServiceInfo

Then start it from the CLI

./start-all-services.sh

And let me know

Re: ambari hearbeat lost after a restart of VM

Mentor

@Chiranjeevi Nimmala

Any updates?

Re: ambari hearbeat lost after a restart of VM

Contributor

@Geoffrey Shelton Okot

Thanks for sharing the steps, however i tried everything mentioned which haven't worked for me. So i have reset the ambari and did a fresh install of everything. Since my cluster did not had any data , this procedure worked. However it would have been very hard if it had data.

I even tried by manually starting the metrics monitor by "/usr/sbin/ambari-metrics-monitor start" , this had created the pid file but the ambari server appeared not recognizing it.

Hoping there might be some work-around which i might have missed. Thanks alot for your time :)

Re: ambari hearbeat lost after a restart of VM

Mentor

@Chiranjeevi Nimmala

Cheers lessons learnt !

Unfortunately you didn't have enough time to resolve the issue because I imagine you got the issue in a cluster with data.