Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Ambari Agent 2.5.1 deadlock

Highlighted

Ambari Agent 2.5.1 deadlock

New Contributor

I'm evaluating the recently released 2.5.1 to consider an upgrade.

However, I found that in a period ranging from a few minutes to a few hours, the ambari agents one by one lose the heartbeat until eventually all are in a zombie state. The process are running find and there is no sign of errors on the logs, it just stops in the middle of its routine checks.

Doing a thread dump, I got that consistently the issue is that a fork call doesn't return for some reason and the rest of the thread are waiting for the lock that fork call holds (looking that the code that synchronization is something new in 2.5). The only solution is to restart ambari-agent, the script after a way needs to kill -9 the process since it doesn't respond to the stop signal. I'm running on VMs which might be more error prone to racing issues.

I'm using a couple of homemade components but they have been working fine for a year with Ambari 2.4, I don't see how they could affect the fork call not returning.

Following a sample thread dump of the deadlock.

*** STACKTRACE - START ***




# ThreadID: 139794667529984
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 764, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs
  func(*args, **kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job
  retval = job.func(*job.args, **job.kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda>
  return lambda: alert_def.collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect
  res = self._collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/metric_alert.py", line 97, in _collect
  jmx_property_values, http_code = self._load_jmx(alert_uri.is_ssl_enabled, host, port, self.metric_info)
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/metric_alert.py", line 212, in _load_jmx
  connection_timeout=self.curl_connection_timeout, kinit_timer_ms = self.kinit_timeout)
File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 94, in curl_krb_request
  import uuid


# ThreadID: 139794650744576
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 764, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs
  func(*args, **kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job
  retval = job.func(*job.args, **job.kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda>
  return lambda: alert_def.collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect
  res = self._collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/metric_alert.py", line 97, in _collect
  jmx_property_values, http_code = self._load_jmx(alert_uri.is_ssl_enabled, host, port, self.metric_info)
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/metric_alert.py", line 212, in _load_jmx
  connection_timeout=self.curl_connection_timeout, kinit_timer_ms = self.kinit_timeout)
File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 94, in curl_krb_request
  import uuid


# ThreadID: 139795212793600
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib/python2.6/site-packages/ambari_agent/Controller.py", line 497, in run
  self.registerAndHeartbeat()
File: "/usr/lib/python2.6/site-packages/ambari_agent/Controller.py", line 525, in registerAndHeartbeat
  self.heartbeatWithServer()
File: "/usr/lib/python2.6/site-packages/ambari_agent/Controller.py", line 313, in heartbeatWithServer
  data = json.dumps(self.heartbeat.build(self.responseId, send_state, self.hasMappedComponents))
File: "/usr/lib/python2.6/site-packages/ambari_agent/Heartbeat.py", line 46, in build
  queueResult = self.actionQueue.result()
File: "/usr/lib/python2.6/site-packages/ambari_agent/ActionQueue.py", line 571, in result
  return self.commandStatuses.generate_report()
File: "/usr/lib/python2.6/site-packages/ambari_agent/CommandStatusDict.py", line 88, in generate_report
  from ActionQueue import ActionQueue


# ThreadID: 139795162437376
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 764, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs
  func(*args, **kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job
  retval = job.func(*job.args, **job.kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda>
  return lambda: alert_def.collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect
  res = self._collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/script_alert.py", line 90, in _collect
  cmd_module = self._load_source()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/script_alert.py", line 172, in _load_source
  return imp.load_source(self._get_alert_meta_value_safely('name'), self.path_to_script)
File: "/var/lib/ambari-agent/cache/common-services/AMBARI_METRICS/0.1.0/package/alerts/alert_ambari_metrics_monitor.py", line 21, in <module>
  import os


# ThreadID: 139795435849472
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib/python2.6/site-packages/ambari_agent/PingPortListener.py", line 67, in run
  conn, addr = self.socket.accept()
File: "/usr/lib64/python2.7/socket.py", line 202, in accept
  sock, addr = self._sock.accept()


# ThreadID: 139795179222784
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 764, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs
  func(*args, **kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job
  retval = job.func(*job.args, **job.kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda>
  return lambda: alert_def.collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect
  res = self._collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/script_alert.py", line 115, in _collect
  result = cmd_module.execute(configurations, self.parameters, self.host_name)
File: "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_checkpoint_time.py", line 189, in execute
  kinit_timer_ms = kinit_timer_ms)
File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 122, in curl_krb_request
  kinit_lock.acquire()
File: "/usr/lib64/python2.7/threading.py", line 173, in acquire
  rc = self.__block.acquire(blocking)


# ThreadID: 139794625566464
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 764, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs
  func(*args, **kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job
  retval = job.func(*job.args, **job.kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda>
  return lambda: alert_def.collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect
  res = self._collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 102, in _collect
  web_response = self._make_web_request(url)
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 201, in _make_web_request
  connection_timeout=self.curl_connection_timeout, kinit_timer_ms = self.kinit_timeout)
File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 94, in curl_krb_request
  import uuid


# ThreadID: 139795423160064
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 764, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 590, in _main_loop
  self._wakeup.wait(wait_seconds)
File: "/usr/lib64/python2.7/threading.py", line 621, in wait
  self.__cond.wait(timeout, balancing)
File: "/usr/lib64/python2.7/threading.py", line 361, in wait
  _sleep(delay)


# ThreadID: 139795170830080
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 764, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs
  func(*args, **kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job
  retval = job.func(*job.args, **job.kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda>
  return lambda: alert_def.collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect
  res = self._collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/metric_alert.py", line 97, in _collect
  jmx_property_values, http_code = self._load_jmx(alert_uri.is_ssl_enabled, host, port, self.metric_info)
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/metric_alert.py", line 212, in _load_jmx
  connection_timeout=self.curl_connection_timeout, kinit_timer_ms = self.kinit_timeout)
File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 122, in curl_krb_request
  kinit_lock.acquire()
File: "/usr/lib64/python2.7/threading.py", line 173, in acquire
  rc = self.__block.acquire(blocking)


# ThreadID: 139794105480960
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 764, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs
  func(*args, **kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job
  retval = job.func(*job.args, **job.kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda>
  return lambda: alert_def.collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect
  res = self._collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 102, in _collect
  web_response = self._make_web_request(url)
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 201, in _make_web_request
  connection_timeout=self.curl_connection_timeout, kinit_timer_ms = self.kinit_timeout)
File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 94, in curl_krb_request
  import uuid


# ThreadID: 139795187615488
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 764, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs
  func(*args, **kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job
  retval = job.func(*job.args, **job.kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda>
  return lambda: alert_def.collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect
  res = self._collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 102, in _collect
  web_response = self._make_web_request(url)
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 201, in _make_web_request
  connection_timeout=self.curl_connection_timeout, kinit_timer_ms = self.kinit_timeout)
File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 94, in curl_krb_request
  import uuid


# ThreadID: 139793585395456
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 764, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs
  func(*args, **kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job
  retval = job.func(*job.args, **job.kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda>
  return lambda: alert_def.collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect
  res = self._collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 102, in _collect
  web_response = self._make_web_request(url)
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 201, in _make_web_request
  connection_timeout=self.curl_connection_timeout, kinit_timer_ms = self.kinit_timeout)
File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 122, in curl_krb_request
  kinit_lock.acquire()
File: "/usr/lib64/python2.7/threading.py", line 173, in acquire
  rc = self.__block.acquire(blocking)


# ThreadID: 139793593788160
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 764, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs
  func(*args, **kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job
  retval = job.func(*job.args, **job.kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda>
  return lambda: alert_def.collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect
  res = self._collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/metric_alert.py", line 97, in _collect
  jmx_property_values, http_code = self._load_jmx(alert_uri.is_ssl_enabled, host, port, self.metric_info)
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/metric_alert.py", line 216, in _load_jmx
  url_opener = urllib2.build_opener(RefreshHeaderProcessor())
File: "/usr/lib64/python2.7/urllib2.py", line 490, in build_opener
  import types


# ThreadID: 139794642351872
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 764, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs
  func(*args, **kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job
  retval = job.func(*job.args, **job.kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda>
  return lambda: alert_def.collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect
  res = self._collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/script_alert.py", line 115, in _collect
  result = cmd_module.execute(configurations, self.parameters, self.host_name)
File: "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanagers_summary.py", line 138, in execute
  kinit_timer_ms = kinit_timer_ms)
File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 142, in curl_krb_request
  is_kinit_required = (shell.call(klist_command, user=user)[0] != 0)
File: "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 72, in inner
  result = function(command, **kwargs)
File: "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 115, in call
  tries=tries, try_sleep=try_sleep, timeout_kill_strategy=timeout_kill_strategy)
File: "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 150, in _call_wrapper
  result = _call(command, **kwargs_copy)
File: "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 223, in _call
  preexec_fn=preexec_fn)
File: "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 66, in sp_locked_init
  sp_original_init(self, *a, **kw)
File: "/usr/lib64/python2.7/subprocess.py", line 711, in __init__
  errread, errwrite)
File: "/usr/lib64/python2.7/subprocess.py", line 1224, in _execute_child
  self.pid = os.fork()


# ThreadID: 139794633959168
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 764, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs
  func(*args, **kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job
  retval = job.func(*job.args, **job.kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda>
  return lambda: alert_def.collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect
  res = self._collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/script_alert.py", line 115, in _collect
  result = cmd_module.execute(configurations, self.parameters, self.host_name)
File: "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py", line 166, in execute
  connection_timeout=curl_connection_timeout, kinit_timer_ms = kinit_timer_ms)
File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 209, in curl_krb_request
  _, curl_stdout, curl_stderr = get_user_call_output(curl_command, user=user, env=kerberos_env)
File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py", line 50, in get_user_call_output
  code, _ = shell.call(shell.as_user(command_string, user), quiet=quiet, **call_kwargs)
File: "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 61, in inner
  Logger.info(log_msg)
File: "/usr/lib/python2.6/site-packages/resource_management/core/logger.py", line 75, in info
  Logger.logger.info(Logger.filter_text(text))
File: "/usr/lib/python2.6/site-packages/resource_management/core/logger.py", line 102, in filter_text
  from resource_management.core.shell import PLACEHOLDERS_TO_STR


# ThreadID: 139795196008192
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 764, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs
  func(*args, **kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job
  retval = job.func(*job.args, **job.kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda>
  return lambda: alert_def.collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect
  res = self._collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/script_alert.py", line 115, in _collect
  result = cmd_module.execute(configurations, self.parameters, self.host_name)
File: "/var/lib/ambari-agent/cache/common-services/HIVE/0.12.0.2.0/package/alerts/alert_webhcat_server.py", line 160, in execute
  kinit_timer_ms = kinit_timer_ms)
File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 122, in curl_krb_request
  kinit_lock.acquire()
File: "/usr/lib64/python2.7/threading.py", line 173, in acquire
  rc = self.__block.acquire(blocking)


# ThreadID: 139794675922688
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 764, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs
  func(*args, **kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job
  retval = job.func(*job.args, **job.kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda>
  return lambda: alert_def.collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect
  res = self._collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/metric_alert.py", line 97, in _collect
  jmx_property_values, http_code = self._load_jmx(alert_uri.is_ssl_enabled, host, port, self.metric_info)
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/metric_alert.py", line 212, in _load_jmx
  connection_timeout=self.curl_connection_timeout, kinit_timer_ms = self.kinit_timeout)
File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 122, in curl_krb_request
  kinit_lock.acquire()
File: "/usr/lib64/python2.7/threading.py", line 173, in acquire
  rc = self.__block.acquire(blocking)


# ThreadID: 139795680462656
File: "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 472, in <module>
  main(heartbeat_stop_callback)
File: "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 451, in main
  run_threads(server_hostname, heartbeat_stop_callback)
File: "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 335, in run_threads
  time.sleep(0.1)
File: "/usr/lib/python2.6/site-packages/ambari_agent/RemoteDebugUtils.py", line 35, in print_threads_stack_traces
  for filename, lineno, name, line in traceback.extract_stack(stack):


# ThreadID: 139794097088256
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 764, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs
  func(*args, **kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job
  retval = job.func(*job.args, **job.kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda>
  return lambda: alert_def.collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect
  res = self._collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 102, in _collect
  web_response = self._make_web_request(url)
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 201, in _make_web_request
  connection_timeout=self.curl_connection_timeout, kinit_timer_ms = self.kinit_timeout)
File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 122, in curl_krb_request
  kinit_lock.acquire()
File: "/usr/lib64/python2.7/threading.py", line 173, in acquire
  rc = self.__block.acquire(blocking)


# ThreadID: 139794113873664
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 764, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs
  func(*args, **kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job
  retval = job.func(*job.args, **job.kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda>
  return lambda: alert_def.collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect
  res = self._collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/script_alert.py", line 115, in _collect
  result = cmd_module.execute(configurations, self.parameters, self.host_name)
File: "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_upgrade_finalized.py", line 132, in execute
  "HDFS Upgrade Finalized State", smokeuser, kinit_timer_ms = kinit_timer_ms
File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 122, in curl_krb_request
  kinit_lock.acquire()
File: "/usr/lib64/python2.7/threading.py", line 173, in acquire
  rc = self.__block.acquire(blocking)


# ThreadID: 139795444242176
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib/python2.6/site-packages/ambari_agent/DataCleaner.py", line 123, in run
  time.sleep(self.cleanup_interval)


# ThreadID: 139794139051776
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 764, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs
  func(*args, **kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job
  retval = job.func(*job.args, **job.kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda>
  return lambda: alert_def.collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect
  res = self._collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 102, in _collect
  web_response = self._make_web_request(url)
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 201, in _make_web_request
  connection_timeout=self.curl_connection_timeout, kinit_timer_ms = self.kinit_timeout)
File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 122, in curl_krb_request
  kinit_lock.acquire()
File: "/usr/lib64/python2.7/threading.py", line 173, in acquire
  rc = self.__block.acquire(blocking)


# ThreadID: 139794659137280
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 764, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs
  func(*args, **kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job
  retval = job.func(*job.args, **job.kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda>
  return lambda: alert_def.collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect
  res = self._collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 102, in _collect
  web_response = self._make_web_request(url)
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 201, in _make_web_request
  connection_timeout=self.curl_connection_timeout, kinit_timer_ms = self.kinit_timeout)
File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 122, in curl_krb_request
  kinit_lock.acquire()
File: "/usr/lib64/python2.7/threading.py", line 173, in acquire
  rc = self.__block.acquire(blocking)


# ThreadID: 139795204400896
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib/python2.6/site-packages/ambari_agent/ActionQueue.py", line 149, in run
  self.controller.get_status_commands_executor().process_results() # process status commands
File: "/usr/lib/python2.6/site-packages/ambari_agent/StatusCommandsExecutor.py", line 76, in process_results
  self.actionQueue.process_status_command_result(self.actionQueue.execute_status_command_and_security_status(command))
File: "/usr/lib/python2.6/site-packages/ambari_agent/ActionQueue.py", line 500, in execute_status_command_and_security_status
  component_status_result = self.customServiceOrchestrator.requestComponentStatus(command)
File: "/usr/lib/python2.6/site-packages/ambari_agent/CustomServiceOrchestrator.py", line 471, in requestComponentStatus
  override_output_files=override_output_files)
File: "/usr/lib/python2.6/site-packages/ambari_agent/CustomServiceOrchestrator.py", line 412, in runCommand
  handle = handle, log_info_on_failure=log_info_on_failure)
File: "/usr/lib/python2.6/site-packages/ambari_agent/PythonReflectiveExecutor.py", line 59, in run_file
  imp.load_source('__main__', script)
File: "/var/lib/ambari-agent/cache/common-services/REST_API/1.1.0-SNAPSHOT/package/scripts/play.py", line 192, in <module>
  PlayServer().execute()
File: "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 329, in execute
  method(env)
File: "/var/lib/ambari-agent/cache/common-services/REST_API/1.1.0-SNAPSHOT/package/scripts/play.py", line 41, in status
  from env_params import pid_file
File: "/var/lib/ambari-agent/cache/common-services/REST_API/1.1.0-SNAPSHOT/package/scripts/env_params.py", line 4, in <module>
  from install_params import deploy_dir
File: "/var/lib/ambari-agent/cache/common-services/REST_API/1.1.0-SNAPSHOT/package/scripts/install_params.py", line 6, in <module>
  code, hdp_version = call("hdp-select versions | tail -1")
File: "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 72, in inner
  result = function(command, **kwargs)
File: "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 115, in call
  tries=tries, try_sleep=try_sleep, timeout_kill_strategy=timeout_kill_strategy)
File: "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 150, in _call_wrapper
  result = _call(command, **kwargs_copy)
File: "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 223, in _call
  preexec_fn=preexec_fn)
File: "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 65, in sp_locked_init
  with lock:
File: "/usr/lib64/python2.7/threading.py", line 173, in acquire
  rc = self.__block.acquire(blocking)


# ThreadID: 139794130659072
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 764, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs
  func(*args, **kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job
  retval = job.func(*job.args, **job.kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda>
  return lambda: alert_def.collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect
  res = self._collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/metric_alert.py", line 97, in _collect
  jmx_property_values, http_code = self._load_jmx(alert_uri.is_ssl_enabled, host, port, self.metric_info)
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/metric_alert.py", line 212, in _load_jmx
  connection_timeout=self.curl_connection_timeout, kinit_timer_ms = self.kinit_timeout)
File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 94, in curl_krb_request
  import uuid


# ThreadID: 139794122266368
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 764, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs
  func(*args, **kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job
  retval = job.func(*job.args, **job.kwargs)
File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda>
  return lambda: alert_def.collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect
  res = self._collect()
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 102, in _collect
  web_response = self._make_web_request(url)
File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 201, in _make_web_request
  connection_timeout=self.curl_connection_timeout, kinit_timer_ms = self.kinit_timeout)
File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 122, in curl_krb_request
  kinit_lock.acquire()
File: "/usr/lib64/python2.7/threading.py", line 173, in acquire
  rc = self.__block.acquire(blocking)


*** STACKTRACE - END ***

4 REPLIES 4

Re: Ambari Agent 2.5.1 deadlock

New Contributor

Hi @Gonzalo Herreros,

You are right, there is a deadlock between threads 139795204400896 and 139794642351872 in your dump.

One of the locks is the Ambari-specific lock added as an attempt to workaround threading issues in the subprocess module that we have run into. The other one is required for imports. It turns out that os.fork tries to acquire the import lock.

Thread 139795204400896 calls subprocess.Popen from a module being imported. Lock order is: import, subprocess.

Thread 139794642351872 simply calls subprocess.Popen, which calls os.fork. Lock order is: subprocess, import.

We are now dropping the custom lock in favor of the subprocess32 module, which is the backport of Python 3.2 subprocess implementation, and is recommended even in the Python 2 docs.

Thanks for reporting this, and sorry for taking so long to answer.

Re: Ambari Agent 2.5.1 deadlock

New Contributor

Hi @Doroszlai, Attila,

I am running into this issue as well. I tracked down the Github PR that has the proposed change for subprocess32 (https://github.com/apache/ambari/pull/313). Any idea when this fix will be released?

Thanks,

Emil

Re: Ambari Agent 2.5.1 deadlock

New Contributor

Emil, there is a very simple workaround.
On each node edit /usr/lib/python2.6/site-packages/ambari_agent/main.py and comment out the line:
fix_subprocess_popen()

I have been running without that with no issues

Re: Ambari Agent 2.5.1 deadlock

New Contributor

Thank you @Gonzalo Herreros. I will give this a try.