Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Cloudera-scm-agent auto restart every twenty minutes

avatar
New Contributor

Here is the question, cloudera-scm-agent process will auto restart about every twenty minutes.

I checked the cloudera-scm-agent.log file, I found this:

[01/Nov/2017 17:09:06 +0000] 48384 Metadata-Plugin navigator_plugin INFO     Refreshing Metadata Plugin for cloudera-mgmt-SERVICEMONITOR with pipelines []
[01/Nov/2017 17:09:06 +0000] 48384 Metadata-Plugin navigator_plugin_pipeline INFO     Stopping Navigator Plugin Pipeline '' for cloudera-mgmt-SERVICEMONITOR (log dir: None) 
[01/Nov/2017 17:09:06 +0000] 48384 Metadata-Plugin navigator_plugin INFO     Refreshing Metadata Plugin for zookeeper-server with pipelines []
[01/Nov/2017 17:09:06 +0000] 48384 Metadata-Plugin navigator_plugin_pipeline INFO     Stopping Navigator Plugin Pipeline '' for zookeeper-server (log dir: None) 
[01/Nov/2017 17:09:06 +0000] 48384 Metadata-Plugin navigator_plugin INFO     Refreshing Metadata Plugin for cloudera-mgmt-ACTIVITYMONITOR with pipelines []
[01/Nov/2017 17:09:06 +0000] 48384 Metadata-Plugin navigator_plugin_pipeline INFO     Stopping Navigator Plugin Pipeline '' for cloudera-mgmt-ACTIVITYMONITOR (log dir: None) 
[01/Nov/2017 17:09:06 +0000] 48384 Metadata-Plugin navigator_plugin INFO     Refreshing Metadata Plugin for cloudera-mgmt-HOSTMONITOR with pipelines []
[01/Nov/2017 17:09:06 +0000] 48384 Metadata-Plugin navigator_plugin_pipeline INFO     Stopping Navigator Plugin Pipeline '' for cloudera-mgmt-HOSTMONITOR (log dir: None) 
[01/Nov/2017 17:09:06 +0000] 48384 Metadata-Plugin navigator_plugin INFO     Refreshing Metadata Plugin for cloudera-mgmt-EVENTSERVER with pipelines []
[01/Nov/2017 17:09:06 +0000] 48384 Metadata-Plugin navigator_plugin_pipeline INFO     Stopping Navigator Plugin Pipeline '' for cloudera-mgmt-EVENTSERVER (log dir: None) 
[01/Nov/2017 17:09:06 +0000] 48384 Metadata-Plugin navigator_plugin INFO     Refreshing Metadata Plugin for cloudera-mgmt-ALERTPUBLISHER with pipelines []
[01/Nov/2017 17:09:06 +0000] 48384 Metadata-Plugin navigator_plugin_pipeline INFO     Stopping Navigator Plugin Pipeline '' for cloudera-mgmt-ALERTPUBLISHER (log dir: None) 
[01/Nov/2017 17:09:22 +0000] 48384 DnsResolutionMonitor throttling_logger INFO     Using java location: '/usr/java/jdk1.7.0_67-cloudera/bin/java'.
[01/Nov/2017 17:11:05 +0000] 48384 MainThread agent        WARNING  Supervisor failed (pid 93786).  Restarting agent.
[01/Nov/2017 17:11:06 +0000] 49600 MainThread agent        INFO     SCM Agent Version: 5.10.0
[01/Nov/2017 17:11:06 +0000] 49600 MainThread agent        INFO     Agent Protocol Version: 4
[01/Nov/2017 17:11:06 +0000] 49600 MainThread agent        INFO     Using Host ID: 4bc2ad11-36d7-468b-9167-8ec5e0ffbe9f
[01/Nov/2017 17:11:06 +0000] 49600 MainThread agent        INFO     Using directory: /var/run/cloudera-scm-agent
[01/Nov/2017 17:11:06 +0000] 49600 MainThread agent        INFO     Using supervisor binary path: /usr/lib64/cmf/agent/build/env/bin/supervisord
[01/Nov/2017 17:11:06 +0000] 49600 MainThread agent        INFO     Neither verify_cert_file nor verify_cert_dir are configured. Not performing validation of server certificates in HTTPS communication. These options can be configured in this agent's config.ini file to enable certificate validation. 
[01/Nov/2017 17:11:06 +0000] 49600 MainThread agent        INFO     Agent Logging Level: INFO
[01/Nov/2017 17:11:06 +0000] 49600 MainThread agent        INFO     No command line vars
[01/Nov/2017 17:11:06 +0000] 49600 MainThread agent        INFO     Found database jar: /usr/share/java/mysql-connector-java.jar
[01/Nov/2017 17:11:06 +0000] 49600 MainThread agent        INFO     Missing database jar: /usr/share/java/oracle-connector-java.jar (normal, if you're not using this database type) 
[01/Nov/2017 17:11:06 +0000] 49600 MainThread agent        INFO     Found database jar: /usr/share/cmf/lib/postgresql-9.0-801.jdbc4.jar
[01/Nov/2017 17:11:06 +0000] 49600 MainThread agent        INFO     Agent starting as pid 49600 user root(0) group root(0).
[01/Nov/2017 17:11:08 +0000] 49600 MainThread agent        INFO     Re-using pre-existing directory: /var/run/cloudera-scm-agent/cgroups
[01/Nov/2017 17:11:08 +0000] 49600 MainThread cgroups      INFO     Found cgroups subsystem: cpu
[01/Nov/2017 17:11:08 +0000] 49600 MainThread cgroups      INFO     Found cgroups subsystem: cpuacct
[01/Nov/2017 17:11:08 +0000] 49600 MainThread cgroups      INFO     Found cgroups subsystem: memory
[01/Nov/2017 17:11:08 +0000] 49600 MainThread cgroups      INFO     Found cgroups subsystem: blkio 
[01/Nov/2017 17:11:08 +0000] 49600 MainThread cgroups      INFO     Reusing /var/run/cloudera-scm-agent/cgroups/memory
[01/Nov/2017 17:11:08 +0000] 49600 MainThread cgroups      INFO     Reusing /var/run/cloudera-scm-agent/cgroups/cpu
[01/Nov/2017 17:11:08 +0000] 49600 MainThread cgroups      INFO     Reusing /var/run/cloudera-scm-agent/cgroups/cpuacct
[01/Nov/2017 17:11:08 +0000] 49600 MainThread cgroups      INFO     Reusing /var/run/cloudera-scm-agent/cgroups/blkio
[01/Nov/2017 17:11:08 +0000] 49600 MainThread agent        INFO     Found cgroups capabilities: {'has_memory': True, 'default_memory_limit_in_bytes': -1, 'default_memory_soft_limit_in_bytes': -1, 'writable_cgroup_dot_procs': True, 'default_cpu_rt_runtime_us': 950000, 'has_cpu': True, 'default_blkio_weight': 1000, 'default_cpu_shares': 1024, 'has_cpuacct': True, 'has_blkio': True}
[01/Nov/2017 17:11:08 +0000] 49600 MainThread agent        INFO     Setting up supervisord event monitor.
[01/Nov/2017 17:11:08 +0000] 49600 MainThread filesystem_map INFO     Monitored nodev filesystem types: ['nfs', 'nfs4', 'tmpfs']
[01/Nov/2017 17:11:08 +0000] 49600 MainThread filesystem_map INFO     Using timeout of 2.000000
[01/Nov/2017 17:11:08 +0000] 49600 MainThread filesystem_map INFO     Using join timeout of 0.100000
[01/Nov/2017 17:11:08 +0000] 49600 MainThread filesystem_map INFO     Using tolerance of 60.000000

That message is printed about every twenty minutes.

I want to know what cause this and how to fix it.

If you have any idea,please tell me.

thanks very much.

1 ACCEPTED SOLUTION

avatar
New Contributor

Thank you very much for your reply.

I check all the processes. But i didn't found something wrong.

Finally, i checked the source code of the cloudera-scm-agent(/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/cmf-5.10.0-py2.6.egg/cmf/agent.py), and i found this

 

 

  def supervisor_check(self):
    """
    Check if supervisord process is still alive.  If not, restart the agent; which will
    attempt to restart supervisord via our normal startup path.
    """
    if self.supervisor_psutil_process is None or self.supervisor_psutil_process.is_running():
      return

    LOG.warning("Supervisor failed (pid %s).  Restarting agent.", self.supervisor_pid)
    LOG.info("Restart agent command:%s", str(exec_args))
    try:
      os.execv(exec_args[0], exec_args)
    except Exception, e:
      LOG.exception("Failed agent re-exec: %s", e)
      LOG.error("Agent exiting.")
      sys.exit(-1)

Then I knew the cloudera-scm-agent restarted is because the agent will check whether the process of supervisor 

 

is running, if it's check failed, it will restart agent.

But I found that process the supervisor always running normal.So why agent check failed? The first thing is to find out how cloudera-scm-agent check.I added some code to agent.py, to see more information about supervisor when agent checked it. LIke this:

  def supervisor_check(self):
    """
    Check if supervisord process is still alive.  If not, restart the agent; which will
    attempt to restart supervisord via our normal startup path.
    """

    """Here is the added code start"""
    LOG.info("Supervisor process info:%s", str(self.supervisor_psutil_process.as_dict(attrs=['pid', 'name', 'username'])))
    LOG.info("Supervisor process:%s", str(self.supervisor_psutil_process))
    LOG.info("Supervisor process status:%s", self.supervisor_psutil_process.status())
    LOG.info("Supervisor process is_running:%s", self.supervisor_psutil_process.is_running())
    LOG.info("Restart agent command:%s", str(exec_args))
    """Here is the added code stop"""
    if self.supervisor_psutil_process is None or self.supervisor_psutil_process.is_running():
      return

    LOG.warning("Supervisor failed (pid %s).  Restarting agent.", self.supervisor_pid)
    LOG.info("Restart agent command:%s", str(exec_args))
    try:
      os.execv(exec_args[0], exec_args)
    except Exception, e:
      LOG.exception("Failed agent re-exec: %s", e)
      LOG.error("Agent exiting.")
      sys.exit(-1)

The I restart the cloudera-scm-agent, and check the output of cloudera-scm-agent.log.

I found this

angent_restart.png

So check failed is because the 

self.supervisor_psutil_process.is_running()

 return false. The I checked the psutil source code, here it is

    def _init(self, pid, _ignore_nsp=False):
        if pid is None:
            pid = os.getpid()
        else:
            if not _PY3 and not isinstance(pid, (int, long)):
                raise TypeError('pid must be an integer (got %r)' % pid)
            if pid < 0:
                raise ValueError('pid must be a positive integer (got %s)'
                                 % pid)
        self._pid = pid
        self._name = None
        self._exe = None
        self._create_time = None
        self._gone = False
        self._hash = None
        # used for caching on Windows only (on POSIX ppid may change)
        self._ppid = None
        # platform-specific modules define an _psplatform.Process
        # implementation class
        self._proc = _psplatform.Process(pid)
        self._last_sys_cpu_times = None
        self._last_proc_cpu_times = None
        # cache creation time for later use in is_running() method
        try:
            self.create_time()
        except AccessDenied:
            # we should never get here as AFAIK we're able to get
            # process creation time on all platforms even as a
            # limited user
            pass
        except NoSuchProcess:
            if not _ignore_nsp:
                msg = 'no process found with pid %s' % pid
                raise NoSuchProcess(pid, None, msg)
            else:
                self._gone = True
        # This pair is supposed to indentify a Process instance
        # univocally over time (the PID alone is not enough as
        # it might refer to a process whose PID has been reused).
        # This will be used later in __eq__() and is_running().
        self._ident = (self.pid, self._create_time)
 
   def __eq__(self, other):
        # Test for equality with another Process object based
        # on PID and creation time.
        if not isinstance(other, Process):
            return NotImplemented
        return self._ident == other._ident

    def is_running(self):
        """Return whether this process is running.
        It also checks if PID has been reused by another process in
        which case return False.
        """
        if self._gone:
            return False
        try:
            # Checking if PID is alive is not enough as the PID might
            # have been reused by another process: we also want to
            # check process identity.
            # Process identity / uniqueness over time is greanted by
            # (PID + creation time) and that is verified in __eq__.
            return self == Process(self.pid)
        except NoSuchProcess:
            self._gone = True
            return False

So the check failed is because the process._create_time is changed,but the supervisor is running normal. 

Then i guess it's the system time changed cause this, because we set the crontab mission to sync the system time every minute. We deleted the crontab mission, and the problem is fixed. And then we use the ntpd to sync 

system time, the problem didn't show again. 

View solution in original post

5 REPLIES 5

avatar
Super Collaborator

Did you check the supervisor log ?

avatar
New Contributor

supervisor_log.png

Here is the supervisor log. I don‘t know much about the supervisor. So please help me to check it. Thanks very much.

avatar
Super Collaborator

Does the log shown is correlated to an observed reboot of the agent ?

If yes, I would investigate this "flood" service that seems to constantly reboot.

 

Possible cause of a never ending looping restart service : out of memory > agent kill the service > agent restart the service > out of memory > repeat.

 

regards,

Mathieu

 

 

avatar
New Contributor

Thank you very much for your reply.

I check all the processes. But i didn't found something wrong.

Finally, i checked the source code of the cloudera-scm-agent(/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/cmf-5.10.0-py2.6.egg/cmf/agent.py), and i found this

 

 

  def supervisor_check(self):
    """
    Check if supervisord process is still alive.  If not, restart the agent; which will
    attempt to restart supervisord via our normal startup path.
    """
    if self.supervisor_psutil_process is None or self.supervisor_psutil_process.is_running():
      return

    LOG.warning("Supervisor failed (pid %s).  Restarting agent.", self.supervisor_pid)
    LOG.info("Restart agent command:%s", str(exec_args))
    try:
      os.execv(exec_args[0], exec_args)
    except Exception, e:
      LOG.exception("Failed agent re-exec: %s", e)
      LOG.error("Agent exiting.")
      sys.exit(-1)

Then I knew the cloudera-scm-agent restarted is because the agent will check whether the process of supervisor 

 

is running, if it's check failed, it will restart agent.

But I found that process the supervisor always running normal.So why agent check failed? The first thing is to find out how cloudera-scm-agent check.I added some code to agent.py, to see more information about supervisor when agent checked it. LIke this:

  def supervisor_check(self):
    """
    Check if supervisord process is still alive.  If not, restart the agent; which will
    attempt to restart supervisord via our normal startup path.
    """

    """Here is the added code start"""
    LOG.info("Supervisor process info:%s", str(self.supervisor_psutil_process.as_dict(attrs=['pid', 'name', 'username'])))
    LOG.info("Supervisor process:%s", str(self.supervisor_psutil_process))
    LOG.info("Supervisor process status:%s", self.supervisor_psutil_process.status())
    LOG.info("Supervisor process is_running:%s", self.supervisor_psutil_process.is_running())
    LOG.info("Restart agent command:%s", str(exec_args))
    """Here is the added code stop"""
    if self.supervisor_psutil_process is None or self.supervisor_psutil_process.is_running():
      return

    LOG.warning("Supervisor failed (pid %s).  Restarting agent.", self.supervisor_pid)
    LOG.info("Restart agent command:%s", str(exec_args))
    try:
      os.execv(exec_args[0], exec_args)
    except Exception, e:
      LOG.exception("Failed agent re-exec: %s", e)
      LOG.error("Agent exiting.")
      sys.exit(-1)

The I restart the cloudera-scm-agent, and check the output of cloudera-scm-agent.log.

I found this

angent_restart.png

So check failed is because the 

self.supervisor_psutil_process.is_running()

 return false. The I checked the psutil source code, here it is

    def _init(self, pid, _ignore_nsp=False):
        if pid is None:
            pid = os.getpid()
        else:
            if not _PY3 and not isinstance(pid, (int, long)):
                raise TypeError('pid must be an integer (got %r)' % pid)
            if pid < 0:
                raise ValueError('pid must be a positive integer (got %s)'
                                 % pid)
        self._pid = pid
        self._name = None
        self._exe = None
        self._create_time = None
        self._gone = False
        self._hash = None
        # used for caching on Windows only (on POSIX ppid may change)
        self._ppid = None
        # platform-specific modules define an _psplatform.Process
        # implementation class
        self._proc = _psplatform.Process(pid)
        self._last_sys_cpu_times = None
        self._last_proc_cpu_times = None
        # cache creation time for later use in is_running() method
        try:
            self.create_time()
        except AccessDenied:
            # we should never get here as AFAIK we're able to get
            # process creation time on all platforms even as a
            # limited user
            pass
        except NoSuchProcess:
            if not _ignore_nsp:
                msg = 'no process found with pid %s' % pid
                raise NoSuchProcess(pid, None, msg)
            else:
                self._gone = True
        # This pair is supposed to indentify a Process instance
        # univocally over time (the PID alone is not enough as
        # it might refer to a process whose PID has been reused).
        # This will be used later in __eq__() and is_running().
        self._ident = (self.pid, self._create_time)
 
   def __eq__(self, other):
        # Test for equality with another Process object based
        # on PID and creation time.
        if not isinstance(other, Process):
            return NotImplemented
        return self._ident == other._ident

    def is_running(self):
        """Return whether this process is running.
        It also checks if PID has been reused by another process in
        which case return False.
        """
        if self._gone:
            return False
        try:
            # Checking if PID is alive is not enough as the PID might
            # have been reused by another process: we also want to
            # check process identity.
            # Process identity / uniqueness over time is greanted by
            # (PID + creation time) and that is verified in __eq__.
            return self == Process(self.pid)
        except NoSuchProcess:
            self._gone = True
            return False

So the check failed is because the process._create_time is changed,but the supervisor is running normal. 

Then i guess it's the system time changed cause this, because we set the crontab mission to sync the system time every minute. We deleted the crontab mission, and the problem is fixed. And then we use the ntpd to sync 

system time, the problem didn't show again. 

avatar
Super Collaborator

Great !

Nice debug