I read the documentation about permissioning in hdfs and it says:
1) by default Hadoop checks "groups" output of linux command for a user
2) "supergroup" is default super users group
My root directory looks like this:
drwxr-xr-x - hdfs supergroup 0 2015-01-27 23:08 /
So i would assume only hdfs and users belonging to supergroup would be able to create directory under it. But there is no "supergroup" group on any of my boxes! There is ony "hadoop" one, that contains hdfs, yarn and mapred.
And basically any user that i create, can execute hdfs commands, like hdfs dfs -mkdir /blabla and do whatever he wants. The files created will have set him as an owner, and supergroup as a group. Even though he doesnt belong to "supergroup" neither to "hadoop".
How does it work then? And is there some simple way to prevent it and make it work as in docs, i.e. make hadoop listen to linux permissions (the only access to cluster is through box managed by us anyway, so this would be enough).
Well, this is embarassing - I just saw in my cluster Cloudera Manager that security (dfs.permissioning) is false... That explains everything.
HOWEVER: the reason i was confused is I couldn't see this property set in any of conf files (grep dfs.permissions /etc/hadoop/conf/*.xml). And according to documentation the default value is true. Could anyone please let me know then where does this property get overriden?
Administrators are sometimes surprised that modifying /etc/hadoop/conf and then restarting HDFS has no effect
Oh yes. Ok, I think I understand now where server-side configuration comes from. I can find it in CM, altough I still have a bit problem with finding it on a filesystem. When I go to my namenode I see this:
root@node9:/var/run/cloudera-scm-agent/process# ls -qlrt | grep NAME drwxr-x--x 3 hdfs hdfs 420 Sep 28 12:40 4035-hdfs-NAMENODE drwxr-x--x 3 hdfs hdfs 420 Sep 28 13:17 4091-hdfs-NAMENODE-refresh drwxr-x--x 3 hdfs hdfs 420 Sep 28 13:18 4092-hdfs-NAMENODE-monitor-decommissioning drwxr-x--x 3 hdfs hdfs 420 Sep 28 13:24 4097-hdfs-NAMENODE-refresh drwxr-x--x 3 hdfs hdfs 420 Sep 28 13:25 4098-hdfs-NAMENODE-monitor-decommissioning drwxr-x--x 3 hdfs hdfs 420 Sep 28 13:25 4100-hdfs-NAMENODE-refresh drwxr-x--x 3 hdfs hdfs 420 Sep 28 13:30 4149-hdfs-NAMENODE-refresh drwxr-x--x 3 hdfs hdfs 420 Sep 28 13:30 4150-hdfs-NAMENODE-monitor-decommissioning drwxr-x--x 3 hdfs hdfs 420 Sep 28 13:30 4152-hdfs-NAMENODE-refresh drwxr-x--x 3 hdfs hdfs 420 Sep 28 13:46 4167-hdfs-NAMENODE-createdir drwxr-x--x 3 hdfs hdfs 420 Sep 28 13:50 4185-hdfs-NAMENODE drwxr-x--x 3 hdfs hdfs 420 Jan 26 16:19 4785-hdfs-NAMENODE-refresh drwxr-x--x 3 hdfs hdfs 420 Jan 26 16:19 4787-hdfs-NAMENODE-monitor-decommissioning
Soo, which of these contains hdfs-site.xml of my currently running NameNode?
But this still doesn't answer the original question - why is dfs.permissions.superusergroup defaulted to supergroup, and then CM doesn't create it in Linux?
In our case, we also found discrepancies in other default Hadoop user/groups created from the documentation: Guide to Special Users in the Hadoop Environment For example, hdfs user not assigned to hdfs group, mapred group not created at all. We are running CDH 5.2.0 on Debian.
I've done my share of research, and still find the HDFS user/group permission mechanism confusing. For the plain Linux-CDH installation without Kerberos (the majority I believe), HDFS relies on Unix user/group permission mechanism, but interprets in its own way. Thus the confusing and unintuitive behaviors:
Perhaps Cloudera can write a more understandable adaptation of the Apache HDFS document.