Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Failed to connect to previous supervisor

avatar
Explorer

I've chosen to install CDH through Automated installer using Cloudera Manager, the download completes but unable to push through due this error:

 

Installation failed. Failed to receive heartbeat from agent.
Ensure that the host's hostname is configured properly.
Ensure that port 7182 is accessible on the Cloudera Manager Server (check firewall rules).
Ensure that ports 9000 and 9001 are not in use on the host being added.
Check agent logs in /var/log/cloudera-scm-agent/ on the host being added. (Some of the logs can be found in the installation details).
If Use TLS Encryption for Agents is enabled in Cloudera Manager (Administration -> Settings -> Security), ensure that /etc/cloudera-scm-agent/config.ini has use_tls=1 on the host being added. Restart the corresponding agent and click the Retry link here.

However upon checking the details, I saw Failed to connect to previous supervisor in this error details:

Installation script completed successfully.
all done
closing logging file descriptor
>>[18/Jul/2017 01:34:30 +0000] 4355 MainThread agent INFO Re-using pre-existing directory: /run/cloudera-scm-agent/supervisor
>>[18/Jul/2017 01:34:30 +0000] 4355 MainThread agent INFO Re-using pre-existing directory: /run/cloudera-scm-agent/flood
>>[18/Jul/2017 01:34:30 +0000] 4355 MainThread agent INFO Re-using pre-existing directory: /run/cloudera-scm-agent/supervisor/include
>>[18/Jul/2017 01:34:30 +0000] 4355 MainThread agent ERROR Failed to connect to previous supervisor.
>>Traceback (most recent call last):
>> File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.12.0-py2.7.egg/cmf/agent.py", line 2109, in find_or_start_supervisor
>> self.configure_supervisor_clients()
>> File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.12.0-py2.7.egg/cmf/agent.py", line 2290, in configure_supervisor_clients
>> supervisor_options.realize(args=["-c", os.path.join(self.supervisor_dir, "supervisord.conf")])
>> File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/supervisor-3.0-py2.7.egg/supervisor/options.py", line 1599, in realize
>> Options.realize(self, *arg, **kw)
>> File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/supervisor-3.0-py2.7.egg/supervisor/options.py", line 333, in realize
>> self.process_config()
>> File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/supervisor-3.0-py2.7.egg/supervisor/options.py", line 341, in process_config
>> self.process_config_file(do_usage)
>> File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/supervisor-3.0-py2.7.egg/supervisor/options.py", line 376, in process_config_file
>> self.usage(str(msg))
>> File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/supervisor-3.0-py2.7.egg/supervisor/options.py", line 164, in usage
>> self.exit(2)
>>SystemExit: 2
>>[18/Jul/2017 01:34:30 +0000] 4355 Dummy-1 daemonize WARNING Stopping daemon.
>>[18/Jul/2017 01:34:30 +0000] 4355 Dummy-1 agent INFO Stopping agent...
>>[18/Jul/2017 01:34:30 +0000] 4355 Dummy-1 agent INFO No extant cgroups; unmounting any cgroup roots
>>[18/Jul/2017 01:39:14 +0000] 5611 MainThread agent INFO SCM Agent Version: 5.12.0
>>[18/Jul/2017 01:39:14 +0000] 5611 MainThread agent INFO Agent Protocol Version: 4
>>[18/Jul/2017 01:39:14 +0000] 5611 MainThread agent INFO Using Host ID: b9e306ab-b527-4667-9f3e-b6acad9f5224
>>[18/Jul/2017 01:39:14 +0000] 5611 MainThread agent INFO Using directory: /run/cloudera-scm-agent
>>[18/Jul/2017 01:39:14 +0000] 5611 MainThread agent INFO Using supervisor binary path: /usr/lib64/cmf/agent/build/env/bin/supervisord
>>[18/Jul/2017 01:39:14 +0000] 5611 MainThread agent INFO Neither verify_cert_file nor verify_cert_dir are configured. Not performing validation of server certificates in HTTPS communication. These options can be configured in this agent's config.ini file to enable certificate validation.
>>[18/Jul/2017 01:39:14 +0000] 5611 MainThread agent INFO Agent Logging Level: INFO
>>[18/Jul/2017 01:39:14 +0000] 5611 MainThread agent INFO No command line vars
>>[18/Jul/2017 01:39:14 +0000] 5611 MainThread agent INFO Missing database jar: /usr/share/java/mysql-connector-java.jar (normal, if you're not using this database type)
>>[18/Jul/2017 01:39:14 +0000] 5611 MainThread agent INFO Missing database jar: /usr/share/java/oracle-connector-java.jar (normal, if you're not using this database type)
>>[18/Jul/2017 01:39:14 +0000] 5611 MainThread agent INFO Found database jar: /usr/share/cmf/lib/postgresql-9.0-801.jdbc4.jar
>>[18/Jul/2017 01:39:14 +0000] 5611 MainThread agent INFO Agent starting as pid 5611 user root(0) group root(0).


This is my current setup.

CentOS 7.2
Installing CDH 5.11.1 or 5.12 using Cloudera Manager.
4 nodes

/etc/hosts

192.168.0.101 node1.cirro.com node1
192.168.0.102 node2.cirro.com node2
192.168.0.103 node3.cirro.com node3
192.168.0.104 node4.cirro.com node4

/etc/sysconfig/network

NETWORKING=yes
HOSTNAME=myservers*.cirro.com
NOZEROCONF=yes

/etc/ssh/sshd_config

PermitRootLogin yes
PasswordAuthentication yes

hostname has also been set per node to reflect /etc/sysconfig/network.

sestatus = disabled
firewalld = inactive
ntpd = active (running)
httpd = active (running)
vm.swappiness = 10
user = passwordless sudo
/etc/rc.local has been set

Can anyone help me on this? I've been stuck with this for 2 weeks now. I've run out of options and searching online. It would be really appreciated!

1 ACCEPTED SOLUTION

avatar
Explorer

This is what I did instead. I followed Path B and downloaded 5.11.1 version instead. Solved all of my problems.

View solution in original post

29 REPLIES 29

avatar
Explorer

I am encountering the same problem where agent is frequently going down and giving the error.

Failed to connect to newly launched supervisor. Agent will exit

 

and when i checked for the processes running for supervisor with the command you mentioned i found below.

 

root 17693 0.0 0.0 58344 14988 ? Ss 2017 1:46 /usr/lib/cmf/agent/build/env/bin/python /usr/lib/cmf/agent/build/env/bin/superviso rd
root 30637 0.0 0.0 10472 936 pts/13 R+ 19:43 0:00 grep --color=auto supervisor

 

Could you help me further

avatar
Master Guru

@Amir,

 

Please provide us with your agent log showing the error.  There are many reasons why the agent would not be able to connect to the supervisor, so we need to see the agent log information to determine what the cause may be.

 

Thanks,

 

Ben

avatar
Expert Contributor

Here is one agent log that fails to start during Cloudera 5.15 install on Ubuntu 16.04 on WSL:

 Installation failed. Failed to receive heartbeat from agent. Agent log below:

 

[28/Jun/2018 13:06:08 +0000] 2849 MainThread agent ERROR Failed to connect to previous supervisor.
Traceback (most recent call last):
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.15.0-py2.7.egg/cmf/agent.py", line 2136, in find_or_start_supervisor
self.configure_supervisor_clients()
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.15.0-py2.7.egg/cmf/agent.py", line 2317, in configure_supervisor_clients
supervisor_options.realize(args=["-c", os.path.join(self.supervisor_dir, "supervisord.conf")])
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/supervisor-3.0-py2.7.egg/supervisor/options.py", line 1599, in realize
Options.realize(self, *arg, **kw)
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/supervisor-3.0-py2.7.egg/supervisor/options.py", line 333, in realize
self.process_config()
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/supervisor-3.0-py2.7.egg/supervisor/options.py", line 341, in process_config
self.process_config_file(do_usage)
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/supervisor-3.0-py2.7.egg/supervisor/options.py", line 376, in process_config_file
self.usage(str(msg))
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/supervisor-3.0-py2.7.egg/supervisor/options.py", line 164, in usage
self.exit(2)
SystemExit: 2
[28/Jun/2018 13:06:08 +0000] 2849 Dummy-1 daemonize WARNING Stopping daemon.
[28/Jun/2018 13:06:08 +0000] 2849 Dummy-1 agent INFO Stopping agent...
[28/Jun/2018 13:06:08 +0000] 2849 Dummy-1 agent INFO No extant cgroups; unmounting any cgroup roots
[28/Jun/2018 13:09:50 +0000] 4154 MainThread __init__ INFO Agent UUID file was last modified at 2018-06-27 17:12:05.557885
[28/Jun/2018 13:09:50 +0000] 4154 MainThread agent INFO ================================================================================
[28/Jun/2018 13:09:50 +0000] 4154 MainThread agent INFO SCM Agent Version: 5.15.0
[28/Jun/2018 13:09:50 +0000] 4154 MainThread agent INFO Agent Protocol Version: 4
[28/Jun/2018 13:09:50 +0000] 4154 MainThread agent INFO Using Host ID: e018bb4f-e9a3-4855-b906-1f8a9aeb82d8
[28/Jun/2018 13:09:50 +0000] 4154 MainThread agent INFO Using directory: /run/cloudera-scm-agent
[28/Jun/2018 13:09:50 +0000] 4154 MainThread agent INFO Using supervisor binary path: /usr/lib/cmf/agent/build/env/bin/supervisord
[28/Jun/2018 13:09:50 +0000] 4154 MainThread agent INFO Neither verify_cert_file nor verify_cert_dir are configured. Not performing validation of server certificates in HTTPS communicat
ion. These options can be configured in this agent's config.ini file to enable certificate validation.
[28/Jun/2018 13:09:50 +0000] 4154 MainThread agent INFO Agent Logging Level: INFO
[28/Jun/2018 13:09:50 +0000] 4154 MainThread agent INFO No command line vars
[28/Jun/2018 13:09:50 +0000] 4154 MainThread agent INFO Missing database jar: /usr/share/java/mysql-connector-java.jar (normal, if you're not using this database type)
[28/Jun/2018 13:09:50 +0000] 4154 MainThread agent INFO Missing database jar: /usr/share/java/oracle-connector-java.jar (normal, if you're not using this database type)
[28/Jun/2018 13:09:50 +0000] 4154 MainThread agent INFO Found database jar: /usr/share/cmf/lib/postgresql-42.1.4.jre7.jar
[28/Jun/2018 13:09:50 +0000] 4154 MainThread agent INFO Agent starting as pid 4154 user root(0) group root(0).
[28/Jun/2018 13:09:50 +0000] 4154 MainThread agent INFO Re-using pre-existing directory: /run/cloudera-scm-agent/cgroups
[28/Jun/2018 13:09:50 +0000] 4154 MainThread agent INFO Found cgroups capabilities: {'has_memory': False, 'default_memory_limit_in_bytes': -1, 'default_memory_soft_limit_in_bytes': -1,
'writable_cgroup_dot_procs': False, 'default_cpu_rt_runtime_us': -1, 'has_cpu': False, 'default_blkio_weight': -1, 'default_cpu_shares': -1, 'has_cpuacct': False, 'has_blkio': False}
[28/Jun/2018 13:09:50 +0000] 4154 MainThread agent INFO Setting up supervisord event monitor.
[28/Jun/2018 13:09:50 +0000] 4154 MainThread filesystem_map INFO Monitored nodev filesystem types: ['nfs', 'nfs4', 'tmpfs']
[28/Jun/2018 13:09:50 +0000] 4154 MainThread filesystem_map INFO Using timeout of 2.000000
[28/Jun/2018 13:09:50 +0000] 4154 MainThread filesystem_map INFO Using join timeout of 0.100000
[28/Jun/2018 13:09:50 +0000] 4154 MainThread filesystem_map INFO Using tolerance of 60.000000
[28/Jun/2018 13:09:50 +0000] 4154 MainThread filesystem_map INFO Local filesystem types whitelist: ['ext2', 'ext3', 'ext4', 'xfs']
[28/Jun/2018 13:09:50 +0000] 4154 MainThread filesystem_map ERROR Error reading partition info
Traceback (most recent call last):
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.15.0-py2.7.egg/cmf/monitor/host/filesystem_map.py", line 92, in refresh
for p in self.get_all_mounted_partitions():
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.15.0-py2.7.egg/cmf/monitor/host/filesystem_map.py", line 124, in get_all_mounted_partitions
return psutil.disk_partitions(all=True)
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/psutil-2.1.3-py2.7-linux-x86_64.egg/psutil/__init__.py", line 1705, in disk_partitions

return _psplatform.disk_partitions(all)
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/psutil-2.1.3-py2.7-linux-x86_64.egg/psutil/_pslinux.py", line 712, in disk_partitions
partitions = cext.disk_partitions()
OSError: [Errno 2] No such file or directory: '/etc/mtab'
[28/Jun/2018 13:09:50 +0000] 4154 MainThread kt_renewer INFO Agent wide credential cache set to /run/cloudera-scm-agent/krb5cc_cm_agent_0
[28/Jun/2018 13:09:50 +0000] 4154 MainThread agent INFO Using metrics_url_timeout_seconds of 30.000000
[28/Jun/2018 13:09:50 +0000] 4154 MainThread agent INFO Using task_metrics_timeout_seconds of 5.000000
[28/Jun/2018 13:09:50 +0000] 4154 MainThread agent INFO Using max_collection_wait_seconds of 10.000000
[28/Jun/2018 13:09:50 +0000] 4154 MainThread metrics INFO Importing tasktracker metric schema from file /usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.15.0-py2.7.egg/cmf/m
onitor/tasktracker/schema.json
[28/Jun/2018 13:09:50 +0000] 4154 MainThread disk_devices WARNING Unable to read disk devices to monitor; disk metrics will not be collected: [Errno 2] No such file or directory: '/proc/partition
s'
[28/Jun/2018 13:09:50 +0000] 4154 MainThread disk_devices WARNING Unable to read disk device statistics for disk monitoring: [Errno 2] No such file or directory: '/proc/diskstats'
[28/Jun/2018 13:09:50 +0000] 4154 MainThread throttling_logger WARNING File '/proc/sys/kernel/random/entropy_avail' couldn't be opened for entropy count collection, error=2
[28/Jun/2018 13:09:50 +0000] 4154 MainThread ntp_monitor INFO Using timeout of 2.000000
[28/Jun/2018 13:09:50 +0000] 4154 MainThread dns_names INFO Using timeout of 30.000000
[28/Jun/2018 13:09:50 +0000] 4154 MainThread __init__ INFO Created DNS monitor.
[28/Jun/2018 13:09:50 +0000] 4154 MainThread stacks_collection_manager INFO Using max_uncompressed_file_size_bytes: 5242880
[28/Jun/2018 13:09:50 +0000] 4154 MainThread __init__ INFO Importing metric schema from file /usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.15.0-py2.7.egg/cmf/monitor/schem
a.json
[28/Jun/2018 13:09:51 +0000] 4154 MainThread agent INFO Supervised processes will add the following to their environment (in addition to the supervisor's env): {'CDH_PARQUET_HOME': '/us
r/lib/parquet', 'JSVC_HOME': '/usr/libexec/bigtop-utils', 'CMF_PACKAGE_DIR': '/usr/lib/cmf/service', 'CDH_HADOOP_BIN': '/usr/bin/hadoop', 'MGMT_HOME': '/usr/share/cmf', 'CDH_IMPALA_HOME': '/usr/li
b/impala', 'CDH_YARN_HOME': '/usr/lib/hadoop-yarn', 'CDH_HDFS_HOME': '/usr/lib/hadoop-hdfs', 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games', 'CD
H_HUE_PLUGINS_HOME': '/usr/lib/hadoop', 'CM_STATUS_CODES': u'STATUS_NONE HDFS_DFS_DIR_NOT_EMPTY HBASE_TABLE_DISABLED HBASE_TABLE_ENABLED JOBTRACKER_IN_STANDBY_MODE YARN_RM_IN_STANDBY_MODE', 'KEYTR
USTEE_KP_HOME': '/usr/share/keytrustee-keyprovider', 'CLOUDERA_ORACLE_CONNECTOR_JAR': '/usr/share/java/oracle-connector-java.jar', 'CDH_SQOOP2_HOME': '/usr/lib/sqoop2', 'KEYTRUSTEE_SERVER_HOME': '
/usr/lib/keytrustee-server', 'CDH_MR2_HOME': '/usr/lib/hadoop-mapreduce', 'HIVE_DEFAULT_XML': '/etc/hive/conf.dist/hive-default.xml', 'CLOUDERA_POSTGRESQL_JDBC_JAR': '/usr/share/cmf/lib/postgresql
-42.1.4.jre7.jar', 'CDH_KMS_HOME': '/usr/lib/hadoop-kms', 'CDH_HBASE_HOME': '/usr/lib/hbase', 'CDH_SQOOP_HOME': '/usr/lib/sqoop', 'WEBHCAT_DEFAULT_XML': '/etc/hive-webhcat/conf.dist/webhcat-defaul
t.xml', 'CDH_OOZIE_HOME': '/usr/lib/oozie', 'CDH_ZOOKEEPER_HOME': '/usr/lib/zookeeper', 'CDH_HUE_HOME': '/usr/lib/hue', 'CLOUDERA_MYSQL_CONNECTOR_JAR': '/usr/share/java/mysql-connector-java.jar',
'CDH_HBASE_INDEXER_HOME': '/usr/lib/hbase-solr', 'CDH_MR1_HOME': '/usr/lib/hadoop-0.20-mapreduce', 'CDH_SOLR_HOME': '/usr/lib/solr', 'CDH_PIG_HOME': '/usr/lib/pig', 'CDH_SENTRY_HOME': '/usr/lib/se
ntry', 'CDH_CRUNCH_HOME': '/usr/lib/crunch', 'CDH_LLAMA_HOME': '/usr/lib/llama/', 'CDH_HTTPFS_HOME': '/usr/lib/hadoop-httpfs', 'CDH_HADOOP_HOME': '/usr/lib/hadoop', 'CDH_HIVE_HOME': '/usr/lib/hive
', 'ORACLE_HOME': '/usr/share/oracle/instantclient', 'CDH_HCAT_HOME': '/usr/lib/hive-hcatalog', 'CDH_KAFKA_HOME': '/usr/lib/kafka', 'CDH_SPARK_HOME': '/usr/lib/spark', 'TOMCAT_HOME': '/usr/lib/big
top-tomcat', 'CDH_FLUME_HOME': '/usr/lib/flume-ng'}
[28/Jun/2018 13:09:51 +0000] 4154 MainThread agent INFO To override these variables, use /etc/cloudera-scm-agent/config.ini. Environment variables for CDH locations are not used when CD
H is installed from parcels.
[28/Jun/2018 13:09:51 +0000] 4154 MainThread agent INFO Re-using pre-existing directory: /run/cloudera-scm-agent/process
[28/Jun/2018 13:09:51 +0000] 4154 MainThread agent INFO Re-using pre-existing directory: /run/cloudera-scm-agent/supervisor
[28/Jun/2018 13:09:51 +0000] 4154 MainThread agent INFO Re-using pre-existing directory: /run/cloudera-scm-agent/flood
[28/Jun/2018 13:09:51 +0000] 4154 MainThread agent INFO Re-using pre-existing directory: /run/cloudera-scm-agent/supervisor/include
[28/Jun/2018 13:09:51 +0000] 4154 MainThread agent ERROR Failed to connect to previous supervisor.
Traceback (most recent call last):
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.15.0-py2.7.egg/cmf/agent.py", line 2136, in find_or_start_supervisor
self.configure_supervisor_clients()
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.15.0-py2.7.egg/cmf/agent.py", line 2317, in configure_supervisor_clients
supervisor_options.realize(args=["-c", os.path.join(self.supervisor_dir, "supervisord.conf")])
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/supervisor-3.0-py2.7.egg/supervisor/options.py", line 1599, in realize
Options.realize(self, *arg, **kw)
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/supervisor-3.0-py2.7.egg/supervisor/options.py", line 333, in realize
self.process_config()
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/supervisor-3.0-py2.7.egg/supervisor/options.py", line 341, in process_config
self.process_config_file(do_usage)
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/supervisor-3.0-py2.7.egg/supervisor/options.py", line 376, in process_config_file
self.usage(str(msg))
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/supervisor-3.0-py2.7.egg/supervisor/options.py", line 164, in usage
self.exit(2)
SystemExit: 2
[28/Jun/2018 13:09:51 +0000] 4154 Dummy-1 daemonize WARNING Stopping daemon.
[28/Jun/2018 13:09:51 +0000] 4154 Dummy-1 agent INFO Stopping agent...
[28/Jun/2018 13:09:51 +0000] 4154 Dummy-1 agent INFO No extant cgroups; unmounting any cgroup roots
(END)

 

avatar
Master Guru

@ebeb,

 

We can tell from the stack trace that the failure occurred when the agent acted as a client of the supervisor, attempts to read the supervisord.conf failed.  Further information is likely in

/var/log/cloudera-scm-agent/supervisord.log or supervisord.out

 

I suggest checking them for clues about the cause.

 

Also, try connecting with a command line utility to see if that gives any more error information:

 

# /usr/lib64/cmf/agent/build/env/bin/supervisorctl -c /var/run/cloudera-scm-agent/supervisor/supervisord.conf

 

 

avatar
Expert Contributor

Thanks so much 🙂

Instead of /usr/lib64 I found /usr/lib. After that

 

/usr/lib/cmf/agent/build/env/bin# ./supervisorctl -c /var/run/cloudera-scm-agent/supervisor/supervisord.conf
Error: could not find config file /var/run/cloudera-scm-agent/supervisor/supervisord.conf
For help, use ./supervisorctl -h

 

Then I checked that $service supervisor status gave unrecognized service.

So I installed supervisor using command: $ apt-get install supervisor

and started the service:

$ service supervisor start
supervisord is running

 

Now if I run the command below I get a prompt. I tried to re-run install but it still fails with: Failed to receive heartbeat from agent.

 

/usr/lib/cmf/agent/build/env/bin# ./supervisorctl -c /etc/supervisor/supervisord.conf
supervisor>

 

avatar
Master Guru

@ebeb,

 

The Cloudera Manager agents use their own supervisor so installing and running the supervisord as a separate service will not help.

 

At this stage, it may actually be reasonable to kill the supervisor as there is something quite wrong where the supervisord.conf does not exist.

 

NOTE:  The following will kill all child processes of the supervisor (including any hadoop processes that are running).

It will also clean out the /var/run/cloudera-scm-agent directory and recreate files from scratch.

 

(1)

Try stopping the agent in a way that will kill the supervisor and any running agent processes:

 

# service cloudera-scm-agent hard_stop_confirmed

 

(2)

 

run:

 

# ps aux |grep supervisord

 

If you see a supervisord process, kill it

Make sure no supervisord processes are running

 

(3)

 

Run:

 

# service cloudera-scm-agent clean_start

 

After this, check to see if the agent is heartbeating.

 

These steps I don't recommend often as usually there are better ways to isolate the root cause, but something very bad seems to have happened to the supervisor and/or supervisor's configuration file.

 

 

avatar
Expert Contributor

Thanks here are the results now:

 

/var/run# service cloudera-scm-agent hard_stop_confirmed
cloudera-scm-agent is already stopped
supervisord is already stopped

 

# ps aux |grep supervisord
root 55 0.0 0.0 63792 1104 tty2 S 09:42 0:00 grep --color=auto supervisord

 

/var# service cloudera-scm-agent clean_start
Starting cloudera-scm-agent: * cloudera-scm-agent started

 

/var/run/cloudera-scm-agent/supervisor# service cloudera-scm-agent status
Checking for service cloudera-scm-agent * cloudera-scm-agent is not running

 

/var/run/cloudera-scm-agent/supervisor# ll
total 4
drwxr-x--x 0 root root 512 Jun 29 09:56 ./
drwxr-x--x 0 root root 512 Jun 29 09:56 ../
drwxr-x--x 0 root root 512 Jun 29 09:56 include/

 

There are two errors in the /var/log/cloudera-scm-agent.log:

 

[29/Jun/2018 09:56:14 +0000] 187 MainThread filesystem_map INFO Local filesystem types whitelist: ['ext2', 'ext3', 'ext4', 'xfs']
[29/Jun/2018 09:56:14 +0000] 187 MainThread filesystem_map ERROR Error reading partition info
Traceback (most recent call last):
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.15.0-py2.7.egg/cmf/monitor/host/filesystem_map.py", line 92, in refresh
for p in self.get_all_mounted_partitions():
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.15.0-py2.7.egg/cmf/monitor/host/filesystem_map.py", line 124, in get_all_mounted_partitions
return psutil.disk_partitions(all=True)
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/psutil-2.1.3-py2.7-linux-x86_64.egg/psutil/__init__.py", line 1705, in disk_partitions
return _psplatform.disk_partitions(all)
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/psutil-2.1.3-py2.7-linux-x86_64.egg/psutil/_pslinux.py", line 712, in disk_partitions
partitions = cext.disk_partitions()
OSError: [Errno 2] No such file or directory: '/etc/mtab'
[29/Jun/2018 09:56:14 +0000] 187 MainThread kt_renewer INFO Agent wide credential cache set to /run/cloudera-scm-agent/krb5cc_cm_agent_0

 

 

[29/Jun/2018 09:56:14 +0000] 187 MainThread agent ERROR Failed to connect to previous supervisor.
Traceback (most recent call last):
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.15.0-py2.7.egg/cmf/agent.py", line 2136, in find_or_start_supervisor
self.configure_supervisor_clients()
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.15.0-py2.7.egg/cmf/agent.py", line 2317, in configure_supervisor_clients
supervisor_options.realize(args=["-c", os.path.join(self.supervisor_dir, "supervisord.conf")])
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/supervisor-3.0-py2.7.egg/supervisor/options.py", line 1599, in realize
Options.realize(self, *arg, **kw)
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/supervisor-3.0-py2.7.egg/supervisor/options.py", line 333, in realize
self.process_config()
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/supervisor-3.0-py2.7.egg/supervisor/options.py", line 341, in process_config
self.process_config_file(do_usage)
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/supervisor-3.0-py2.7.egg/supervisor/options.py", line 376, in process_config_file
self.usage(str(msg))
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/supervisor-3.0-py2.7.egg/supervisor/options.py", line 164, in usage
self.exit(2)
SystemExit: 2
[29/Jun/2018 09:56:14 +0000] 187 Dummy-1 daemonize WARNING Stopping daemon.
[29/Jun/2018 09:56:14 +0000] 187 Dummy-1 agent INFO Stopping agent...
[29/Jun/2018 09:56:14 +0000] 187 Dummy-1 agent INFO No extant cgroups; unmounting any cgroup roots

 

 

Any thoughts?

avatar
Master Guru

@ebeb,

 

Actually, I just realized something that is very important. The supervisor stack trace you got is normal if no supervisor is running and no supervisord.conf file exists.  I just tested and I see exactly the same stack trace if I delete my supervisord.conf.

 

The next thing that happens after this exception is that the agent attempts to start a supervisord process.

The first step is to run "mount_tmpfs".  I fee like there must be something going wrong with that codepath because we don't see any other lines after that.

 

I went back and looked at your agent errors and one seems very relevant:

 

OSError: [Errno 2] No such file or directory: '/etc/mtab'

 

It seems your /etc/mtab has gone missing.  I just tested by removing the symbolic link and got exactly the same problem you are seeing.

 

RESOLUTION:

 

recreate /etc/mtab

 

use:

 

# ln -s /proc/self/mounts /etc/mtab

 

NOTE:  you might check first to see if the contents of /proc/self/mounts looks right.

 

Hope this does the trick!

 

 

 

avatar
Expert Contributor

Yes thats great info! Once I recreated # ln -s /proc/self/mounts /etc/mtab the agent started running.

 

/var/log/cloudera-scm-server# service cloudera-scm-agent status
Checking for service cloudera-scm-agent * cloudera-scm-agent is running

 

I think that takes care of this issue really appreciate the help 🙂

 

avatar
Expert Contributor

One other thing. It looks like there were some issues with the Ubuntu OS and after switching over to Centos 7.5 the CDH 5.15 install ran without much issues. I have a question though, in the the install screens it has a Data Node configuration value:

 

DataNode Data Directory

dfs.data.dir, dfs.datanode.data.dir

 

Comma-delimited list of directories on the local file system where the DataNode stores HDFS block data. Typical values are /data/N/dfs/dn for N = 1, 2, 3.... These directories should be mounted using the noatime option, and the disks should be configured using JBOD. RAID is not recommended.

In JBOD mode say the server has 20 hard disks so each of the 20 disk will have 20 file mount points. I think we need to set this value to comma-delimited /data/1/dfs/dn, /data/2/dfs/dn, /data/3/dfs/dn....../data/20/dfs/dn . Now what happens if some of the data nodes have different number of JBOD disks say 20 disks in some and 10 disks in others. Since this is a global variable dfs.data.dir how does it allocate the 20 data directories in those data nodes with only 10 JBOD hard disks? Since there is no hostname defined in this variable to indicate different nunber of disks in different hosts. Also in future if new datanodes are added with different number of disks how is this specified while adding new data nodes?

Thanks!