Member since
11-01-2017
3
Posts
0
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5016 | 11-13-2017 12:40 AM |
11-13-2017
12:40 AM
Thank you very much for your reply. I check all the processes. But i didn't found something wrong. Finally, i checked the source code of the cloudera-scm-agent(/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/cmf-5.10.0-py2.6.egg/cmf/agent.py), and i found this def supervisor_check(self):
"""
Check if supervisord process is still alive. If not, restart the agent; which will
attempt to restart supervisord via our normal startup path.
"""
if self.supervisor_psutil_process is None or self.supervisor_psutil_process.is_running():
return
LOG.warning("Supervisor failed (pid %s). Restarting agent.", self.supervisor_pid)
LOG.info("Restart agent command:%s", str(exec_args))
try:
os.execv(exec_args[0], exec_args)
except Exception, e:
LOG.exception("Failed agent re-exec: %s", e)
LOG.error("Agent exiting.")
sys.exit(-1) Then I knew the cloudera-scm-agent restarted is because the agent will check whether the process of supervisor is running, if it's check failed, it will restart agent. But I found that process the supervisor always running normal.So why agent check failed? The first thing is to find out how cloudera-scm-agent check.I added some code to agent.py, to see more information about supervisor when agent checked it. LIke this: def supervisor_check(self):
"""
Check if supervisord process is still alive. If not, restart the agent; which will
attempt to restart supervisord via our normal startup path.
"""
"""Here is the added code start"""
LOG.info("Supervisor process info:%s", str(self.supervisor_psutil_process.as_dict(attrs=['pid', 'name', 'username'])))
LOG.info("Supervisor process:%s", str(self.supervisor_psutil_process))
LOG.info("Supervisor process status:%s", self.supervisor_psutil_process.status())
LOG.info("Supervisor process is_running:%s", self.supervisor_psutil_process.is_running())
LOG.info("Restart agent command:%s", str(exec_args))
"""Here is the added code stop"""
if self.supervisor_psutil_process is None or self.supervisor_psutil_process.is_running():
return
LOG.warning("Supervisor failed (pid %s). Restarting agent.", self.supervisor_pid)
LOG.info("Restart agent command:%s", str(exec_args))
try:
os.execv(exec_args[0], exec_args)
except Exception, e:
LOG.exception("Failed agent re-exec: %s", e)
LOG.error("Agent exiting.")
sys.exit(-1) The I restart the cloudera-scm-agent, and check the output of cloudera-scm-agent.log. I found this So check failed is because the self.supervisor_psutil_process.is_running() return false. The I checked the psutil source code, here it is def _init(self, pid, _ignore_nsp=False):
if pid is None:
pid = os.getpid()
else:
if not _PY3 and not isinstance(pid, (int, long)):
raise TypeError('pid must be an integer (got %r)' % pid)
if pid < 0:
raise ValueError('pid must be a positive integer (got %s)'
% pid)
self._pid = pid
self._name = None
self._exe = None
self._create_time = None
self._gone = False
self._hash = None
# used for caching on Windows only (on POSIX ppid may change)
self._ppid = None
# platform-specific modules define an _psplatform.Process
# implementation class
self._proc = _psplatform.Process(pid)
self._last_sys_cpu_times = None
self._last_proc_cpu_times = None
# cache creation time for later use in is_running() method
try:
self.create_time()
except AccessDenied:
# we should never get here as AFAIK we're able to get
# process creation time on all platforms even as a
# limited user
pass
except NoSuchProcess:
if not _ignore_nsp:
msg = 'no process found with pid %s' % pid
raise NoSuchProcess(pid, None, msg)
else:
self._gone = True
# This pair is supposed to indentify a Process instance
# univocally over time (the PID alone is not enough as
# it might refer to a process whose PID has been reused).
# This will be used later in __eq__() and is_running().
self._ident = (self.pid, self._create_time)
def __eq__(self, other):
# Test for equality with another Process object based
# on PID and creation time.
if not isinstance(other, Process):
return NotImplemented
return self._ident == other._ident
def is_running(self):
"""Return whether this process is running.
It also checks if PID has been reused by another process in
which case return False.
"""
if self._gone:
return False
try:
# Checking if PID is alive is not enough as the PID might
# have been reused by another process: we also want to
# check process identity.
# Process identity / uniqueness over time is greanted by
# (PID + creation time) and that is verified in __eq__.
return self == Process(self.pid)
except NoSuchProcess:
self._gone = True
return False So the check failed is because the process._create_time is changed,but the supervisor is running normal. Then i guess it's the system time changed cause this, because we set the crontab mission to sync the system time every minute. We deleted the crontab mission, and the problem is fixed. And then we use the ntpd to sync system time, the problem didn't show again.
... View more
11-03-2017
07:06 PM
Here is the supervisor log. I don‘t know much about the supervisor. So please help me to check it. Thanks very much.
... View more
11-01-2017
02:44 AM
Here is the question, cloudera-scm-agent process will auto restart about every twenty minutes. I checked the cloudera-scm-agent.log file, I found this: [01/Nov/2017 17:09:06 +0000] 48384 Metadata-Plugin navigator_plugin INFO Refreshing Metadata Plugin for cloudera-mgmt-SERVICEMONITOR with pipelines []
[01/Nov/2017 17:09:06 +0000] 48384 Metadata-Plugin navigator_plugin_pipeline INFO Stopping Navigator Plugin Pipeline '' for cloudera-mgmt-SERVICEMONITOR (log dir: None)
[01/Nov/2017 17:09:06 +0000] 48384 Metadata-Plugin navigator_plugin INFO Refreshing Metadata Plugin for zookeeper-server with pipelines []
[01/Nov/2017 17:09:06 +0000] 48384 Metadata-Plugin navigator_plugin_pipeline INFO Stopping Navigator Plugin Pipeline '' for zookeeper-server (log dir: None)
[01/Nov/2017 17:09:06 +0000] 48384 Metadata-Plugin navigator_plugin INFO Refreshing Metadata Plugin for cloudera-mgmt-ACTIVITYMONITOR with pipelines []
[01/Nov/2017 17:09:06 +0000] 48384 Metadata-Plugin navigator_plugin_pipeline INFO Stopping Navigator Plugin Pipeline '' for cloudera-mgmt-ACTIVITYMONITOR (log dir: None)
[01/Nov/2017 17:09:06 +0000] 48384 Metadata-Plugin navigator_plugin INFO Refreshing Metadata Plugin for cloudera-mgmt-HOSTMONITOR with pipelines []
[01/Nov/2017 17:09:06 +0000] 48384 Metadata-Plugin navigator_plugin_pipeline INFO Stopping Navigator Plugin Pipeline '' for cloudera-mgmt-HOSTMONITOR (log dir: None)
[01/Nov/2017 17:09:06 +0000] 48384 Metadata-Plugin navigator_plugin INFO Refreshing Metadata Plugin for cloudera-mgmt-EVENTSERVER with pipelines []
[01/Nov/2017 17:09:06 +0000] 48384 Metadata-Plugin navigator_plugin_pipeline INFO Stopping Navigator Plugin Pipeline '' for cloudera-mgmt-EVENTSERVER (log dir: None)
[01/Nov/2017 17:09:06 +0000] 48384 Metadata-Plugin navigator_plugin INFO Refreshing Metadata Plugin for cloudera-mgmt-ALERTPUBLISHER with pipelines []
[01/Nov/2017 17:09:06 +0000] 48384 Metadata-Plugin navigator_plugin_pipeline INFO Stopping Navigator Plugin Pipeline '' for cloudera-mgmt-ALERTPUBLISHER (log dir: None)
[01/Nov/2017 17:09:22 +0000] 48384 DnsResolutionMonitor throttling_logger INFO Using java location: '/usr/java/jdk1.7.0_67-cloudera/bin/java'.
[01/Nov/2017 17:11:05 +0000] 48384 MainThread agent WARNING Supervisor failed (pid 93786). Restarting agent.
[01/Nov/2017 17:11:06 +0000] 49600 MainThread agent INFO SCM Agent Version: 5.10.0
[01/Nov/2017 17:11:06 +0000] 49600 MainThread agent INFO Agent Protocol Version: 4
[01/Nov/2017 17:11:06 +0000] 49600 MainThread agent INFO Using Host ID: 4bc2ad11-36d7-468b-9167-8ec5e0ffbe9f
[01/Nov/2017 17:11:06 +0000] 49600 MainThread agent INFO Using directory: /var/run/cloudera-scm-agent
[01/Nov/2017 17:11:06 +0000] 49600 MainThread agent INFO Using supervisor binary path: /usr/lib64/cmf/agent/build/env/bin/supervisord
[01/Nov/2017 17:11:06 +0000] 49600 MainThread agent INFO Neither verify_cert_file nor verify_cert_dir are configured. Not performing validation of server certificates in HTTPS communication. These options can be configured in this agent's config.ini file to enable certificate validation.
[01/Nov/2017 17:11:06 +0000] 49600 MainThread agent INFO Agent Logging Level: INFO
[01/Nov/2017 17:11:06 +0000] 49600 MainThread agent INFO No command line vars
[01/Nov/2017 17:11:06 +0000] 49600 MainThread agent INFO Found database jar: /usr/share/java/mysql-connector-java.jar
[01/Nov/2017 17:11:06 +0000] 49600 MainThread agent INFO Missing database jar: /usr/share/java/oracle-connector-java.jar (normal, if you're not using this database type)
[01/Nov/2017 17:11:06 +0000] 49600 MainThread agent INFO Found database jar: /usr/share/cmf/lib/postgresql-9.0-801.jdbc4.jar
[01/Nov/2017 17:11:06 +0000] 49600 MainThread agent INFO Agent starting as pid 49600 user root(0) group root(0).
[01/Nov/2017 17:11:08 +0000] 49600 MainThread agent INFO Re-using pre-existing directory: /var/run/cloudera-scm-agent/cgroups
[01/Nov/2017 17:11:08 +0000] 49600 MainThread cgroups INFO Found cgroups subsystem: cpu
[01/Nov/2017 17:11:08 +0000] 49600 MainThread cgroups INFO Found cgroups subsystem: cpuacct
[01/Nov/2017 17:11:08 +0000] 49600 MainThread cgroups INFO Found cgroups subsystem: memory
[01/Nov/2017 17:11:08 +0000] 49600 MainThread cgroups INFO Found cgroups subsystem: blkio
[01/Nov/2017 17:11:08 +0000] 49600 MainThread cgroups INFO Reusing /var/run/cloudera-scm-agent/cgroups/memory
[01/Nov/2017 17:11:08 +0000] 49600 MainThread cgroups INFO Reusing /var/run/cloudera-scm-agent/cgroups/cpu
[01/Nov/2017 17:11:08 +0000] 49600 MainThread cgroups INFO Reusing /var/run/cloudera-scm-agent/cgroups/cpuacct
[01/Nov/2017 17:11:08 +0000] 49600 MainThread cgroups INFO Reusing /var/run/cloudera-scm-agent/cgroups/blkio
[01/Nov/2017 17:11:08 +0000] 49600 MainThread agent INFO Found cgroups capabilities: {'has_memory': True, 'default_memory_limit_in_bytes': -1, 'default_memory_soft_limit_in_bytes': -1, 'writable_cgroup_dot_procs': True, 'default_cpu_rt_runtime_us': 950000, 'has_cpu': True, 'default_blkio_weight': 1000, 'default_cpu_shares': 1024, 'has_cpuacct': True, 'has_blkio': True}
[01/Nov/2017 17:11:08 +0000] 49600 MainThread agent INFO Setting up supervisord event monitor.
[01/Nov/2017 17:11:08 +0000] 49600 MainThread filesystem_map INFO Monitored nodev filesystem types: ['nfs', 'nfs4', 'tmpfs']
[01/Nov/2017 17:11:08 +0000] 49600 MainThread filesystem_map INFO Using timeout of 2.000000
[01/Nov/2017 17:11:08 +0000] 49600 MainThread filesystem_map INFO Using join timeout of 0.100000
[01/Nov/2017 17:11:08 +0000] 49600 MainThread filesystem_map INFO Using tolerance of 60.000000 That message is printed about every twenty minutes. I want to know what cause this and how to fix it. If you have any idea,please tell me. thanks very much.
... View more
Labels:
- Labels:
-
Cloudera Manager