Created 11-11-2017 04:58 PM
Unable to start Ambari Agent. I'm getting heartbeat lost for all the services on the server. Since it is Primary namenode. Couldn't identify the status of the services on the server.When I fire ambari-agent start/restart . It started and stopped suddenly .However when I grep ambari in running process but it is actually not running. How can I start ambari agent ..
root 2970771 1 0 Nov08 ? 00:00:00 /usr/bin/python2.6 /usr/lib/python2.6/site-packages/ambari_agent/AmbariAgent.py start root 2970779 2970771 0 Nov08 ? 00:21:24 /usr/bin/python2.6 /usr/lib/python2.6/site-packages/ambari_agent/main.py start
Symptoms:
Using version Python 2.6
Logs didn't say anything other than this actually stopped logging .
ValueError: Unknown format code 'd' for object of type 'float' INFO 2017-11-10 15:45:48,904 DataCleaner.py:120 - Data cleanup started INFO 2017-11-10 15:45:48,908 DataCleaner.py:122 - Data cleanup finished WARNING 2017-11-10 15:46:42,230 base_alert.py:140 - [Alert][ams_metrics_monitor_process] Unable to execute alert. Unable to find 'AMBARI_METRICS/package /alerts/alert_ambari_metrics_monitor.py' as an absolute path or part of /var/lib/ambari-agent/cache/stacks or /var/lib/ambari-agent/cache/host_scripts WARNING 2017-11-10 15:47:42,220 base_alert.py:140 - [Alert][ams_metrics_monitor_process] Unable to execute alert. Unable to find 'AMBARI_METRICS/package /alerts/alert_ambari_metrics_monitor.py' as an absolute path or part of /var/lib/ambari-agent/cache/stacks o r /var/lib/ambari-agent/cache/host_scripts ERROR 2017-11-10 15:47:42,428 scheduler.py:520 - Job "452de60e-d34c-41d8-9748-bcff4784ebe2 (trigger: interval[0:02:00], next run at: 2017-11-10 15:49:42 .210824)" raised an exception Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job retval = job.func(*job.args, **job.kwargs) File "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 114, in <lambda> return lambda: alert_def.collect() File "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 153, in collect data['text'] = res_base_text.format(*res[1]) ValueError: Unknown format code 'd' for object of type 'float' File "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job retval = job.func(*job.args, **job.kwargs) File "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 114, in <lambda> return lambda: alert_def.collect() File "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 153, in collect data['text'] = res_base_text.format(*res[1]) ValueError: Unknown format code 'd' for object of type 'float' WARNING 2017-11-11 11:52:42,221 base_alert.py:140 - [Alert][ams_metrics_monitor_process] Unable to execute alert. Unable to find 'AMBARI_METRICS/package/alerts/alert_ambari_metrics_monitor.py' as an absolute path or part of /var/lib/ambari-agent/cache/stacks or /var/lib/ambari-agent/cache/host_scripts WARNING 2017-11-11 11:53:42,220 base_alert.py:140 - [Alert][ams_metrics_monitor_process] Unable to execute alert. Unable to find 'AMBARI_METRICS/package/alerts/alert_ambari_metrics_monitor.py' as an absolute path or part of /var/lib/ambari-agent/cache/stacks or /var/lib/ambari-agent/cache/host_scripts ERROR 2017-11-11 11:53:42,416 scheduler.py:520 - Job "452de60e-d34c-41d8-9748-bcff4784ebe2 (trigger: interval[0:02:00], next run at: 2017-11-11 11:55:42.210824)" raised an exception Traceback (most recent call last):
@Jay Kumar SenSharma . Please any idea on this .
Created 11-11-2017 05:04 PM
The following error indicates that some Alert definition value seems to be recently changed and specially the float value value is not correct.
File "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 153, in collect data['text'] = res_base_text.format(*res[1])ValueError: Unknown format code 'd' for object of type 'float'
.
So please let us know which alert definition have you changed recently? If you have made any changes then can you please revert it back? (is it "Ambari metrics Monitor Process" Alert definition that you changed recently?)
Which version of ambari are you using?
Can you please share the ambari-server.log as well?.
Also can you please verify the path of "alert_ambari_metrics_monitor.py" file inside "/var/lib/ambari-agent" directory on working agent host and compare the path if it exist on the Non working ambari-agent host?
# cd "/var/lib/ambari-agent" # find . -name "alert_ambari_metrics_monitor.py"
.
Created 11-11-2017 10:06 PM
Yes . The changes made to the Amabri Metrics are because we were unable to start the Ambari metrics few days back. Hence we changed the port numbers as below.
timeline.metrics.service.webapp.address--> 0.0.0.0:7188 and hbase.zookeeper.property.clientPort --> 2181 from 61181 . It is distributed environment.
Ambari Version 2.1.0
Hostname changed <affected host>
11 Nov 2017 04:18:30,378 WARN [ambari-hearbeat-monitor] HeartbeatMonitor:154 - Heartbeat lost from host <affected host> 11 Nov 2017 04:18:30,379 WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component METRICS_MONITOR on <affected host> 11 Nov 2017 04:18:30,379 WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component FLUME_HANDLER on <affected host> 11 Nov 2017 04:18:30,379 WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component HBASE_REST_SERVER on <affected host> 11 Nov 2017 04:18:30,379 WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component HBASE_MASTER on <affected host> 11 Nov 2017 04:18:30,379 WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component ZKFC on <affected host> 11 Nov 2017 04:18:30,380 WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component NAMENODE on <affected host> 11 Nov 2017 04:18:30,380 WARN [ambari-hearbeat-monitor] HeartbeatMonitor:169 - Setting component state to UNKNOWN for component ZOOKEEPER_SERVER on <affected host> 11 Nov 2017 04:18:53,910 INFO [AlertNoticeDispatchService] AlertNoticeDispatchService:279 - There are 5 pending alert notices about to be dispatched... 11 Nov 2017 04:18:54,107 INFO [alert-dispatch-32] EmailDispatcher:88 - Sending email: XXXXXXXXXXXXXXXXXXXXXXX 11 Nov 2017 12:40:53,970 ERROR [qtp-client-6767] MetricsPropertyProvider:185 - Error getting timeline metrics. Can not connect to collector, socket error. 11 Nov 2017 12:41:03,981 ERROR [qtp-client-6767] MetricsPropertyProvider:185 - Error getting timeline metrics. Can not connect to collector, socket error. ERROR [qtp-client-3412] MetricsReportPropertyProvider:223 - Error getting timeline metrics. Can not connect to collector, socket error.
The alert_ambari_metrics_monitor.py is same path for both working ambari-agent host and non ambari-agent host are same .
Created 11-13-2017 04:51 PM
@Jay Kumar SenSharma, Sir . Do we need any more info to proceed on this . can you please help me.
Created 11-15-2017 12:09 AM
can understand , fuser is taking long time to respond to ambari agent. What is the fix other than restart a server.any idea
Created 11-15-2017 02:35 AM
Ambari agent exclusively uses the "fuser" functionality by default and by design. Ambari agent uses "fuser" to check if the ambari-agent port is occupied by another process or not. Sometimes the "fuser" which is a OS command causes hanging indefinitely.
If you are observing that the ambari-agent is not starting due to port check with 'fuser tcp 8670' and hanging on the host, then the solution will be to clear these processes and recover by doing a host reboot on the affected nodes.
.
Created 03-30-2018 08:06 PM
@Jay Kumar SenSharma
Do we need to clear these processes and do a host reboot without stopping the services on that host?
or, Do we need to manually stop the services on that host before rebooting the host?
Please advise.