Created 07-10-2017 08:43 AM
I'm evaluating the recently released 2.5.1 to consider an upgrade.
However, I found that in a period ranging from a few minutes to a few hours, the ambari agents one by one lose the heartbeat until eventually all are in a zombie state. The process are running find and there is no sign of errors on the logs, it just stops in the middle of its routine checks.
Doing a thread dump, I got that consistently the issue is that a fork call doesn't return for some reason and the rest of the thread are waiting for the lock that fork call holds (looking that the code that synchronization is something new in 2.5). The only solution is to restart ambari-agent, the script after a way needs to kill -9 the process since it doesn't respond to the stop signal. I'm running on VMs which might be more error prone to racing issues.
I'm using a couple of homemade components but they have been working fine for a year with Ambari 2.4, I don't see how they could affect the fork call not returning.
Following a sample thread dump of the deadlock.
*** STACKTRACE - START *** # ThreadID: 139794667529984 File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 764, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs func(*args, **kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job retval = job.func(*job.args, **job.kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda> return lambda: alert_def.collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect res = self._collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/metric_alert.py", line 97, in _collect jmx_property_values, http_code = self._load_jmx(alert_uri.is_ssl_enabled, host, port, self.metric_info) File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/metric_alert.py", line 212, in _load_jmx connection_timeout=self.curl_connection_timeout, kinit_timer_ms = self.kinit_timeout) File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 94, in curl_krb_request import uuid # ThreadID: 139794650744576 File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 764, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs func(*args, **kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job retval = job.func(*job.args, **job.kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda> return lambda: alert_def.collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect res = self._collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/metric_alert.py", line 97, in _collect jmx_property_values, http_code = self._load_jmx(alert_uri.is_ssl_enabled, host, port, self.metric_info) File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/metric_alert.py", line 212, in _load_jmx connection_timeout=self.curl_connection_timeout, kinit_timer_ms = self.kinit_timeout) File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 94, in curl_krb_request import uuid # ThreadID: 139795212793600 File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File: "/usr/lib/python2.6/site-packages/ambari_agent/Controller.py", line 497, in run self.registerAndHeartbeat() File: "/usr/lib/python2.6/site-packages/ambari_agent/Controller.py", line 525, in registerAndHeartbeat self.heartbeatWithServer() File: "/usr/lib/python2.6/site-packages/ambari_agent/Controller.py", line 313, in heartbeatWithServer data = json.dumps(self.heartbeat.build(self.responseId, send_state, self.hasMappedComponents)) File: "/usr/lib/python2.6/site-packages/ambari_agent/Heartbeat.py", line 46, in build queueResult = self.actionQueue.result() File: "/usr/lib/python2.6/site-packages/ambari_agent/ActionQueue.py", line 571, in result return self.commandStatuses.generate_report() File: "/usr/lib/python2.6/site-packages/ambari_agent/CommandStatusDict.py", line 88, in generate_report from ActionQueue import ActionQueue # ThreadID: 139795162437376 File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 764, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs func(*args, **kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job retval = job.func(*job.args, **job.kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda> return lambda: alert_def.collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect res = self._collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/script_alert.py", line 90, in _collect cmd_module = self._load_source() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/script_alert.py", line 172, in _load_source return imp.load_source(self._get_alert_meta_value_safely('name'), self.path_to_script) File: "/var/lib/ambari-agent/cache/common-services/AMBARI_METRICS/0.1.0/package/alerts/alert_ambari_metrics_monitor.py", line 21, in <module> import os # ThreadID: 139795435849472 File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File: "/usr/lib/python2.6/site-packages/ambari_agent/PingPortListener.py", line 67, in run conn, addr = self.socket.accept() File: "/usr/lib64/python2.7/socket.py", line 202, in accept sock, addr = self._sock.accept() # ThreadID: 139795179222784 File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 764, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs func(*args, **kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job retval = job.func(*job.args, **job.kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda> return lambda: alert_def.collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect res = self._collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/script_alert.py", line 115, in _collect result = cmd_module.execute(configurations, self.parameters, self.host_name) File: "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_checkpoint_time.py", line 189, in execute kinit_timer_ms = kinit_timer_ms) File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 122, in curl_krb_request kinit_lock.acquire() File: "/usr/lib64/python2.7/threading.py", line 173, in acquire rc = self.__block.acquire(blocking) # ThreadID: 139794625566464 File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 764, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs func(*args, **kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job retval = job.func(*job.args, **job.kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda> return lambda: alert_def.collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect res = self._collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 102, in _collect web_response = self._make_web_request(url) File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 201, in _make_web_request connection_timeout=self.curl_connection_timeout, kinit_timer_ms = self.kinit_timeout) File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 94, in curl_krb_request import uuid # ThreadID: 139795423160064 File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 764, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 590, in _main_loop self._wakeup.wait(wait_seconds) File: "/usr/lib64/python2.7/threading.py", line 621, in wait self.__cond.wait(timeout, balancing) File: "/usr/lib64/python2.7/threading.py", line 361, in wait _sleep(delay) # ThreadID: 139795170830080 File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 764, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs func(*args, **kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job retval = job.func(*job.args, **job.kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda> return lambda: alert_def.collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect res = self._collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/metric_alert.py", line 97, in _collect jmx_property_values, http_code = self._load_jmx(alert_uri.is_ssl_enabled, host, port, self.metric_info) File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/metric_alert.py", line 212, in _load_jmx connection_timeout=self.curl_connection_timeout, kinit_timer_ms = self.kinit_timeout) File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 122, in curl_krb_request kinit_lock.acquire() File: "/usr/lib64/python2.7/threading.py", line 173, in acquire rc = self.__block.acquire(blocking) # ThreadID: 139794105480960 File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 764, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs func(*args, **kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job retval = job.func(*job.args, **job.kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda> return lambda: alert_def.collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect res = self._collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 102, in _collect web_response = self._make_web_request(url) File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 201, in _make_web_request connection_timeout=self.curl_connection_timeout, kinit_timer_ms = self.kinit_timeout) File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 94, in curl_krb_request import uuid # ThreadID: 139795187615488 File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 764, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs func(*args, **kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job retval = job.func(*job.args, **job.kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda> return lambda: alert_def.collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect res = self._collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 102, in _collect web_response = self._make_web_request(url) File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 201, in _make_web_request connection_timeout=self.curl_connection_timeout, kinit_timer_ms = self.kinit_timeout) File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 94, in curl_krb_request import uuid # ThreadID: 139793585395456 File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 764, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs func(*args, **kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job retval = job.func(*job.args, **job.kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda> return lambda: alert_def.collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect res = self._collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 102, in _collect web_response = self._make_web_request(url) File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 201, in _make_web_request connection_timeout=self.curl_connection_timeout, kinit_timer_ms = self.kinit_timeout) File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 122, in curl_krb_request kinit_lock.acquire() File: "/usr/lib64/python2.7/threading.py", line 173, in acquire rc = self.__block.acquire(blocking) # ThreadID: 139793593788160 File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 764, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs func(*args, **kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job retval = job.func(*job.args, **job.kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda> return lambda: alert_def.collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect res = self._collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/metric_alert.py", line 97, in _collect jmx_property_values, http_code = self._load_jmx(alert_uri.is_ssl_enabled, host, port, self.metric_info) File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/metric_alert.py", line 216, in _load_jmx url_opener = urllib2.build_opener(RefreshHeaderProcessor()) File: "/usr/lib64/python2.7/urllib2.py", line 490, in build_opener import types # ThreadID: 139794642351872 File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 764, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs func(*args, **kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job retval = job.func(*job.args, **job.kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda> return lambda: alert_def.collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect res = self._collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/script_alert.py", line 115, in _collect result = cmd_module.execute(configurations, self.parameters, self.host_name) File: "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanagers_summary.py", line 138, in execute kinit_timer_ms = kinit_timer_ms) File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 142, in curl_krb_request is_kinit_required = (shell.call(klist_command, user=user)[0] != 0) File: "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 72, in inner result = function(command, **kwargs) File: "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 115, in call tries=tries, try_sleep=try_sleep, timeout_kill_strategy=timeout_kill_strategy) File: "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 150, in _call_wrapper result = _call(command, **kwargs_copy) File: "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 223, in _call preexec_fn=preexec_fn) File: "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 66, in sp_locked_init sp_original_init(self, *a, **kw) File: "/usr/lib64/python2.7/subprocess.py", line 711, in __init__ errread, errwrite) File: "/usr/lib64/python2.7/subprocess.py", line 1224, in _execute_child self.pid = os.fork() # ThreadID: 139794633959168 File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 764, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs func(*args, **kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job retval = job.func(*job.args, **job.kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda> return lambda: alert_def.collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect res = self._collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/script_alert.py", line 115, in _collect result = cmd_module.execute(configurations, self.parameters, self.host_name) File: "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py", line 166, in execute connection_timeout=curl_connection_timeout, kinit_timer_ms = kinit_timer_ms) File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 209, in curl_krb_request _, curl_stdout, curl_stderr = get_user_call_output(curl_command, user=user, env=kerberos_env) File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py", line 50, in get_user_call_output code, _ = shell.call(shell.as_user(command_string, user), quiet=quiet, **call_kwargs) File: "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 61, in inner Logger.info(log_msg) File: "/usr/lib/python2.6/site-packages/resource_management/core/logger.py", line 75, in info Logger.logger.info(Logger.filter_text(text)) File: "/usr/lib/python2.6/site-packages/resource_management/core/logger.py", line 102, in filter_text from resource_management.core.shell import PLACEHOLDERS_TO_STR # ThreadID: 139795196008192 File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 764, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs func(*args, **kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job retval = job.func(*job.args, **job.kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda> return lambda: alert_def.collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect res = self._collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/script_alert.py", line 115, in _collect result = cmd_module.execute(configurations, self.parameters, self.host_name) File: "/var/lib/ambari-agent/cache/common-services/HIVE/0.12.0.2.0/package/alerts/alert_webhcat_server.py", line 160, in execute kinit_timer_ms = kinit_timer_ms) File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 122, in curl_krb_request kinit_lock.acquire() File: "/usr/lib64/python2.7/threading.py", line 173, in acquire rc = self.__block.acquire(blocking) # ThreadID: 139794675922688 File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 764, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs func(*args, **kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job retval = job.func(*job.args, **job.kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda> return lambda: alert_def.collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect res = self._collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/metric_alert.py", line 97, in _collect jmx_property_values, http_code = self._load_jmx(alert_uri.is_ssl_enabled, host, port, self.metric_info) File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/metric_alert.py", line 212, in _load_jmx connection_timeout=self.curl_connection_timeout, kinit_timer_ms = self.kinit_timeout) File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 122, in curl_krb_request kinit_lock.acquire() File: "/usr/lib64/python2.7/threading.py", line 173, in acquire rc = self.__block.acquire(blocking) # ThreadID: 139795680462656 File: "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 472, in <module> main(heartbeat_stop_callback) File: "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 451, in main run_threads(server_hostname, heartbeat_stop_callback) File: "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 335, in run_threads time.sleep(0.1) File: "/usr/lib/python2.6/site-packages/ambari_agent/RemoteDebugUtils.py", line 35, in print_threads_stack_traces for filename, lineno, name, line in traceback.extract_stack(stack): # ThreadID: 139794097088256 File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 764, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs func(*args, **kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job retval = job.func(*job.args, **job.kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda> return lambda: alert_def.collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect res = self._collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 102, in _collect web_response = self._make_web_request(url) File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 201, in _make_web_request connection_timeout=self.curl_connection_timeout, kinit_timer_ms = self.kinit_timeout) File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 122, in curl_krb_request kinit_lock.acquire() File: "/usr/lib64/python2.7/threading.py", line 173, in acquire rc = self.__block.acquire(blocking) # ThreadID: 139794113873664 File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 764, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs func(*args, **kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job retval = job.func(*job.args, **job.kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda> return lambda: alert_def.collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect res = self._collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/script_alert.py", line 115, in _collect result = cmd_module.execute(configurations, self.parameters, self.host_name) File: "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_upgrade_finalized.py", line 132, in execute "HDFS Upgrade Finalized State", smokeuser, kinit_timer_ms = kinit_timer_ms File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 122, in curl_krb_request kinit_lock.acquire() File: "/usr/lib64/python2.7/threading.py", line 173, in acquire rc = self.__block.acquire(blocking) # ThreadID: 139795444242176 File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File: "/usr/lib/python2.6/site-packages/ambari_agent/DataCleaner.py", line 123, in run time.sleep(self.cleanup_interval) # ThreadID: 139794139051776 File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 764, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs func(*args, **kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job retval = job.func(*job.args, **job.kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda> return lambda: alert_def.collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect res = self._collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 102, in _collect web_response = self._make_web_request(url) File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 201, in _make_web_request connection_timeout=self.curl_connection_timeout, kinit_timer_ms = self.kinit_timeout) File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 122, in curl_krb_request kinit_lock.acquire() File: "/usr/lib64/python2.7/threading.py", line 173, in acquire rc = self.__block.acquire(blocking) # ThreadID: 139794659137280 File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 764, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs func(*args, **kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job retval = job.func(*job.args, **job.kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda> return lambda: alert_def.collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect res = self._collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 102, in _collect web_response = self._make_web_request(url) File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 201, in _make_web_request connection_timeout=self.curl_connection_timeout, kinit_timer_ms = self.kinit_timeout) File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 122, in curl_krb_request kinit_lock.acquire() File: "/usr/lib64/python2.7/threading.py", line 173, in acquire rc = self.__block.acquire(blocking) # ThreadID: 139795204400896 File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File: "/usr/lib/python2.6/site-packages/ambari_agent/ActionQueue.py", line 149, in run self.controller.get_status_commands_executor().process_results() # process status commands File: "/usr/lib/python2.6/site-packages/ambari_agent/StatusCommandsExecutor.py", line 76, in process_results self.actionQueue.process_status_command_result(self.actionQueue.execute_status_command_and_security_status(command)) File: "/usr/lib/python2.6/site-packages/ambari_agent/ActionQueue.py", line 500, in execute_status_command_and_security_status component_status_result = self.customServiceOrchestrator.requestComponentStatus(command) File: "/usr/lib/python2.6/site-packages/ambari_agent/CustomServiceOrchestrator.py", line 471, in requestComponentStatus override_output_files=override_output_files) File: "/usr/lib/python2.6/site-packages/ambari_agent/CustomServiceOrchestrator.py", line 412, in runCommand handle = handle, log_info_on_failure=log_info_on_failure) File: "/usr/lib/python2.6/site-packages/ambari_agent/PythonReflectiveExecutor.py", line 59, in run_file imp.load_source('__main__', script) File: "/var/lib/ambari-agent/cache/common-services/REST_API/1.1.0-SNAPSHOT/package/scripts/play.py", line 192, in <module> PlayServer().execute() File: "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 329, in execute method(env) File: "/var/lib/ambari-agent/cache/common-services/REST_API/1.1.0-SNAPSHOT/package/scripts/play.py", line 41, in status from env_params import pid_file File: "/var/lib/ambari-agent/cache/common-services/REST_API/1.1.0-SNAPSHOT/package/scripts/env_params.py", line 4, in <module> from install_params import deploy_dir File: "/var/lib/ambari-agent/cache/common-services/REST_API/1.1.0-SNAPSHOT/package/scripts/install_params.py", line 6, in <module> code, hdp_version = call("hdp-select versions | tail -1") File: "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 72, in inner result = function(command, **kwargs) File: "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 115, in call tries=tries, try_sleep=try_sleep, timeout_kill_strategy=timeout_kill_strategy) File: "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 150, in _call_wrapper result = _call(command, **kwargs_copy) File: "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 223, in _call preexec_fn=preexec_fn) File: "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 65, in sp_locked_init with lock: File: "/usr/lib64/python2.7/threading.py", line 173, in acquire rc = self.__block.acquire(blocking) # ThreadID: 139794130659072 File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 764, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs func(*args, **kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job retval = job.func(*job.args, **job.kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda> return lambda: alert_def.collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect res = self._collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/metric_alert.py", line 97, in _collect jmx_property_values, http_code = self._load_jmx(alert_uri.is_ssl_enabled, host, port, self.metric_info) File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/metric_alert.py", line 212, in _load_jmx connection_timeout=self.curl_connection_timeout, kinit_timer_ms = self.kinit_timeout) File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 94, in curl_krb_request import uuid # ThreadID: 139794122266368 File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 764, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/threadpool.py", line 95, in _run_jobs func(*args, **kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/apscheduler/scheduler.py", line 512, in _run_job retval = job.func(*job.args, **job.kwargs) File: "/usr/lib/python2.6/site-packages/ambari_agent/AlertSchedulerHandler.py", line 155, in <lambda> return lambda: alert_def.collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/base_alert.py", line 112, in collect res = self._collect() File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 102, in _collect web_response = self._make_web_request(url) File: "/usr/lib/python2.6/site-packages/ambari_agent/alerts/web_alert.py", line 201, in _make_web_request connection_timeout=self.curl_connection_timeout, kinit_timer_ms = self.kinit_timeout) File: "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 122, in curl_krb_request kinit_lock.acquire() File: "/usr/lib64/python2.7/threading.py", line 173, in acquire rc = self.__block.acquire(blocking) *** STACKTRACE - END ***
Created 02-09-2018 05:25 PM
You are right, there is a deadlock between threads 139795204400896 and 139794642351872 in your dump.
One of the locks is the Ambari-specific lock added as an attempt to workaround threading issues in the subprocess module that we have run into. The other one is required for imports. It turns out that os.fork tries to acquire the import lock.
Thread 139795204400896 calls subprocess.Popen from a module being imported. Lock order is: import, subprocess.
Thread 139794642351872 simply calls subprocess.Popen, which calls os.fork. Lock order is: subprocess, import.
We are now dropping the custom lock in favor of the subprocess32 module, which is the backport of Python 3.2 subprocess implementation, and is recommended even in the Python 2 docs.
Thanks for reporting this, and sorry for taking so long to answer.
Created 04-08-2018 05:48 AM
I am running into this issue as well. I tracked down the Github PR that has the proposed change for subprocess32 (https://github.com/apache/ambari/pull/313). Any idea when this fix will be released?
Thanks,
Emil
Created 04-09-2018 10:30 AM
Emil, there is a very simple workaround.
On each node edit /usr/lib/python2.6/site-packages/ambari_agent/main.py and comment out the line:
fix_subprocess_popen()
I have been running without that with no issues
Created 04-09-2018 05:06 PM
Thank you @Gonzalo Herreros. I will give this a try.