So I am updating another cluster from 14.04 to 16.04 and managed to sail past most of the issues I had on my other cluster. HOWEVER... I have rather a chicken and egg problem now: one of my hosts has no heartbeat because it fails to start the cloudera-scm-agent process with the /var/log/cloudera-scm-agent/supervisord.log file showing the datetime error. I know that this can be fixed by re-running the upgrade wizard across all hosts; however the upgrade wizard will not run on a host without a heart beat, which it doesn't have because it needs the parcels updated on it... >.< (In retrospect, I really don't understand how the datetime error is fixed by running the upgrade wizard, unless the lost heartbeat is caused by something else, but I can't find information on that...?)
Setup: Cloudera Express, parcel installation. Server manager 5.15.0, Agents 5.14.2. Right now, all services and roles are stopped. I have HA set up, and the problem server is (of course) the primary namenode, which also has all of the cloudera management services on it.
Is there another way to update the parcels on a host without a heartbeat? My only thought left at this point is
to try and move all roles off. I could perhaps move everything over to the failover node -27 (see below, the first listed with 3 roles), undo the HA setup, decommission the -28 host (the second one which was the primary namenode when I stopped everything) and then try to recommission it, etc.
I'm uncertain if that will work because I've already tried just decomission/recommission on -28 without reworking the roles and that didn't do anything.
I have also checked to be sure that the /etc/apt/sources.list.d/ is up to date and congruent with the other hosts (since it seems like even though there's parcels, etc, the setup still adds in its own repository entries). However, apt-get --reinstall install cloudera-scm-agent didn't work. (All the hosts are currently on 16.04 and basically thoroughly apt update/dist-upgraded.)
Are there any other approaches I could try?
Also, in case it's useful:
root@b10-28:/var/run/cloudera-scm-agent# service cloudera-scm-agent status
● cloudera-scm-agent.service - LSB: Cloudera SCM Agent
Loaded: loaded (/etc/init.d/cloudera-scm-agent; bad; vendor preset: enabled)
Active: active (exited) since Thu 2018-06-28 16:17:20 PDT; 18h ago
Process: 1101 ExecStart=/etc/init.d/cloudera-scm-agent start (code=exited, status=0/SUCCESS)
Jun 28 16:17:14 b10-28 systemd: Starting LSB: Cloudera SCM Agent...
Jun 28 16:17:14 b10-28 su: PAM unable to dlopen(pam_ck_connector.so): /lib/security/pam_ck_conne
Jun 28 16:17:14 b10-28 su: PAM adding faulty module: pam_ck_connector.so
Jun 28 16:17:14 b10-28 su: Successful su for root by root
Jun 28 16:17:14 b10-28 su: + ??? root:root
Jun 28 16:17:14 b10-28 su: pam_unix(su:session): session opened for user root by (uid=0)
Jun 28 16:17:20 b10-28 su: pam_unix(su:session): session closed for user root
Jun 28 16:17:20 b10-28. cloudera-scm-agent: Starting cloudera-scm-agent: * cloudera-scm-agent st
Jun 28 16:17:20 b10-28 systemd: Started LSB: Cloudera SCM Agent.
Jun 29 10:29:09 b10-28 systemd: Started LSB: Cloudera SCM Agent.
root@b10-28:/var/run/cloudera-scm-agent# tail /var/log/supervisor/supervisord.log
2018-06-28 16:11:11,054 CRIT Supervisor running as root (no user in config file)
2018-06-28 16:11:11,054 WARN No file matches via include "/etc/supervisor/conf.d/*.conf"
2018-06-28 16:11:11,074 INFO RPC interface 'supervisor' initialized
2018-06-28 16:11:11,074 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2018-06-28 16:11:11,074 INFO supervisord started with pid 4048
2018-06-28 16:17:17,712 CRIT Supervisor running as root (no user in config file)
2018-06-28 16:17:17,713 WARN No file matches via include "/etc/supervisor/conf.d/*.conf"
2018-06-28 16:17:17,801 INFO RPC interface 'supervisor' initialized
2018-06-28 16:17:17,801 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2018-06-28 16:17:17,801 INFO supervisord started with pid 1091
root@b10-28:/var/run/cloudera-scm-agent# tail /var/log/cloudera-scm-agent/supervisord.out
Traceback (most recent call last):
File "/usr/lib/cmf/agent/build/env/bin/supervisord", line 8, in <module>
from pkg_resources import load_entry_point
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/pkg_resources/__init__.py", line 36, in <module>
File "/usr/lib/python2.7/plistlib.py", line 62, in <module>
ImportError: No module named datetime
The problem here is likely an issue with the distrobution of Cloudera Manager you are using along with the version of Ubuntu.
# rpm -qa |grep cloudera
Make sure you have the "xenial" rpms. If not, you will need to upgrade Cloudera Manager agent and daemons packages on that host.
It sounds as if you have multiple nodes, so if you have upgraded the OS to 16 on all hosts, compare the packages.
Since the RPMs are compiled on the Operating System on which they are intended to be used, it is necessary to ensure that the default version of python exists on that host. For Ubuntu 16 I think it is 2.7.11.
At the least it is worth checking and verifying.
Ugh, solved this. Compared the contents of /etc/apt/sources.list.d/cloudera-manager.list and my bad node was a version off from the rest. Changed it to match, apt update/apt dist-upgrade/reboot ... Now, updating all the parcels and such.