10-01-2017 03:10 PM
The reducer part files in hive that are written using Apache Crunch pipelines has the
correct username but the groupname is marked as '*supergroup*', to which
the user does not belong to.
How does group membership is derived by Apache Crunch? Also, is there a way to
specify the group information to be used for those part files in Apache Crunch
10-02-2017 02:24 PM
In general, if your environment is Kerberoized and kinit with hdfs.keytab then if you create a folder, it will show the owner as hdfs and group as supergroup. Basically the supergroup appears based on your setting in ClouderaManager -> HDFS -> Configuration -> dfs.permissions.superusergroup
drwxr-xr-x - hdfs supergroup 0 2016-10-02 11:45 /user
But the underneath folder & group should be owned by your userid/batchid.
drwxr-xr-x - abc123 abc123 0 2017-08-11 16:38 /user/abc123
But there are some cases where the userid/batchid folder owned by supergroup, then the files created inside the folder might belongs to supergroup as well. Please work with your admin to change the parent folder group id (or) you can change the file group as follows... it may help you
sudo -u hdfs hdfs dfs -chown abc123:abc123 /user/abc123/file1.txt
10-09-2017 03:51 AM
On a closer look, the issue was identified with the tmp folders of Crunch pipelines. By default, crunch uses /tmp directory to store intermediate and final output, and at last copies the output to actual destination. The group ownership of /tmp is supergroup, and hence when copied to actual destination, the group ownership remains the same.
The solution that we are planning is to change the tmp directory location to another directory which has the correct group owner.