The "you are a Hue admin..." error is pertty generic and means that something went wrong when trying to access HDFS. I suggest checking your /var/log/runcpserver.log file for error messages and stack traces that occur when the UI error message occurs. That will help us start investigating the cause. We don't know what will solve your problem until we know what is causing it.
After migrating to CM 6.2.0, we are having a similar problem for the access of users in Hue.
We pinned it down to the /var/run/hue/hue_krb5_ccache disappearing, so that we have the error
GSSError: (('Unspecified GSS failure. Minor code may provide more information', 851968), ('No Kerberos credentials available (default cache: /var/run/hue/hue_krb5_ccache)', -1765328243))
in /var/log/hue/error.log on the HUE servers (we have a one balancer, 4 backends configuration)
We have been forced to regenerate the keytabs for the ticket renewers and restart Hue, just to make the system work for a few hours, until the Kerberos cache disappears again.
We are currently trying to monitor what is actually accessing or removing the ccache.
What "system" are you users logging out of? If it is your OS, it sounds like a kdestroy is being done and your users' cache should not be tied to Hue's cache which is intended only for the Hue process on that host. Hue does not execute a "kdestroy" command so it is very unlikely that Hue is removing the credentials cache.
Happy to help, but I think we need some more details of what you are observing to understand the problem.
I meant people logging in or out the Hue WebUI, being authenticated using the desktop.auth.backend.PamBackend
(and pam login using an LDAP backend)
The system used to work regularly with no issue until we migrated from CM6.1 to CM6.2 a few weeks ago.
We activated "auditd" to trace the access to the ccache file, and the audit log shows access to the ccache when users "log out" from the Hue WebUI
time->Thu Jul 18 15:31:05 2019 type=PROCTITLE msg=audit(1563456665.989:2959): proctitle=707974686F6E322E37002F6F70742F636C6F75646572612F70617263656C732F4344482D362E322E302D312E636468362E322E302E70302E3936373337332F6C69622F6875652F6275696C642F656E762F62696E2F6875650072756E6368657272797079736572766572 type=PATH msg=audit(1563456665.989:2959): item=1 name="/var/run/hue/hue_krb5_ccache" inode=1169 dev=00:17 mode=0100600 ouid=982 ogid=981 rdev=00:00 nametype=DELETE cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0 type=PATH msg=audit(1563456665.989:2959): item=0 name="/var/run/hue/" inode=1150 dev=00:17 mode=040755 ouid=982 ogid=981 rdev=00:00 nametype=PARENT cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0 type=CWD msg=audit(1563456665.989:2959): cwd="/run/cloudera-scm-agent/process/11096-hue-HUE_SERVER" type=SYSCALL msg=audit(1563456665.989:2959): arch=c000003e syscall=87 success=yes exit=0 a0=7f9a9c33f820 a1=2 a2=1 a3=0 items=2 ppid=19438 pid=19449 auid=4294967295 uid=982 gid=981 euid=982 suid=982 fsuid=982 egid=981 sgid=981 fsgid=981 tty=(none) ses=4294967295 comm="python2.7" exe="/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hue/build/env/bin/python2.7" key="hue_krb5_ccache
It seems that the process cancelling the file is
\_ python2.7 /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hue/build/env/bin/hue runcherrypyserver
We tried to better pin down the specific activity from the source code, but with no success.
Which other types of details do you think you need? We might include more renewer log errors, but those are just about the file not existing anymore. From the Cloudera Manager point of view, everything is green and working correctly, because the processes are all active.
WOW! I see what you mean!
How did you track this down to happening when a user logs out of Hue?
I have to admit that PAM auth is used less than most other forms of authentication, so it is possible that no one has hit some sort of issue or bug.
If you have the runcpserver.log that shows time when a user is logging out, I'd be interested in seeing it in case it yields any clues.
In the meantime, I have a cluster I haven't upgraded to CDH 6.2 yet, so I'll set up PAM auth in 6.1.1 and then upgrade just to see. Might be a couple days, but I'll let you know if I think of anything in the meantime.
@mmmunafo, I upgraded to 6.2 and PAM using the default "login" module doesn't cause the credentials cache to be deleted. Any other ideas of how these two things could be tied in? Perhaps the PAM module is doing something special, but it is the PID of the Hue process that is in the audit...
Thank you for your answer.
We (me and my colleagues) were able to reproduce the issue, pinning it down to PAM.
First of all, we are using Ubuntu 18.04 and CM 6.2.0, and this might be related to the error,
that seems to be tied to the PAM usage from a library inside Hue, since we upgraded at once from Ubuntu 16.04 and CM 6.1.0 to the current configuration.
The error is triggered by the PAM login authentication.
Informally, as far as we understand, when Hue starts. it initializes its own kerberos credential cache in
/var/run/hue/krb5_hue_ccache, as defined by the configuration via the ccache_path Hue parameter, mirrored in the KRB5CCNAME environment variable. This is the ccache known also by the ticket renewer.
What it seems to be happening is that, after the Hue Server initialization, that KRB5CCNAME is left in the OS environment, and it is inheredited by the Python pam.authenticate function (as called at line 329 of backend.py).
When any user logs in from the WebUI, the PAM login rule will pass that that environment to its auth/krb5 component, that tries to reuse that /var/run/hue/krb5_hue_ccache file as the one belonging to the user.
Since that file has the hue tickets, this operation fails: PAM/krb5 invalidates it, cancelling it, and
then falls back to using the next possible file location (usually /tmp/krb5_pam_XXXXXX).
The /var/run/hue/krb5_hue_ccache has been now cancelled. and the Ticket Renewer will not find any ticket to be renewed anymore.
The user is logged in, but the hue server has lost its tickets to admin all the other services.
Probably, while KRB5CCNAME must be correctly set for Hue to correctly interact with the other components of the Cloudera system, it should removed from the os.environ while interacting with the PAM subsystem, that should instead rely on the local OS environment.
We are still trying to find a local workaround for the issue, but we hope this info can be useful to help fixing the bug.
Thanks for the clear description. I did notice, when perusing Hue source, that KRB5CCNAME was set and was curious why. It was the only thing I could think of that would be influencing PAM behavior.
Very cool... let me see if I can figure out why we are adding KRB5CCNAME and if we can get rid of it
Not looking good... workarounds may become complex.
Hue uses requests-kerberos and its readme says:
In order to use this library, there must already be a Kerberos Ticket-Granting Ticket(TGT) cached in a Kerberos credential cache. Whether a TGT is available pointing the $KRB5CCNAME to a credential cache with a valid TGT.
That's why Hue sets KRB5CCNAME.
I tried not setting it (commented out of settings.py) and kerberos functions looked for credentials cache in /tmp
What is most surprising about this is that I don't think there were any changes in Hue that would influence PAM auth behavior. KRB5CCNAME has always been set by Hue for the hue process.
Since you upgraded OS version too, I wonder if the pam module behavior has changed.
Is there any chance you can use SPNEGO or LDAP auth directly?