HDP Sandbox v 3.0.1 on VirtualBox v6.1.6
I'm running the following HiveSQL as user maria_dev from the command-line using hive -f codefile.sql:
create database if not exists db_cceng
comment 'CC database by E. Gall'
However, I get the following error message:
0: jdbc:hive2://sandbox-hdp.hortonworks.com:2> create database if not exists db_cceng
. . . . . . . . . . . . . . . . . . . . . . .> comment 'CC database by E. Gall'
. . . . . . . . . . . . . . . . . . . . . . .> location '/user/maria_dev/db_cceng';
Error: Error while compiling statement: FAILED: HiveAccessControlException Permission denied: user [hive] does not have [ALL] privilege on [hdfs://sandbox-hdp.hortonworks.com:8020/user/maria_dev/db_cceng] (state=42000,code=40000)
Closing: 0: jdbc:hive2://sandbox-hdp.hortonworks.com:2181/default;password=hive;serviceDiscoveryMode=zooKeeper;user=hive;zooKeeperNamespace=hiveserver2
The directory permissions for HDFS subfolder /user/maria_dev are:
drwxrwxr-x - maria_dev hdfs 0 2020-05-21 13:08 /user/maria_dev
If I run the above code on the HDP Sandbox v2.6.5, it works fine.
I'm unsure why the message refers to user [hive] - is it trying to execute the code as user hive instead of maria_dev? If so, are there any extra environmental variables I need to set to indicate that the HiveQL file should run as user maria_dev?
After further reading, the following are the different ideas I explored - but still no solution.
1) Whether Hive needs to be started as a different end user than hive – I'm guessing this shouldn't be required. See: https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.1/starting-hive/content/hive_start_hive_as_autho...
2) Checking that the ACL settings for subfolders created within hdfs:/user/maria_dev are appropriate to allow user hive full access. However, I confirmed that:
[root@sandbox-hdp ~]# hdfs dfs -getfacl -R /user/maria_dev
# file: /user/maria_dev
# owner: maria_dev
# group: hdfs
3) A suitable resource-based authorisation policy needs to be configured in Ranger. I'm guessing this would only be relevant after the db is created. Plus, the current settings allow user hive to access any database in Hive.
So, I'm not sure what else needs to be checked.
In the interim, I've just removed the LOCATION clause in my CREATE DATABASE statement to allow the db to be created in the managed area, so I can get on with my learning. However, I'd appreciate any guidance on how folks create dbs in specific, external locations using Hive 3.0.1 or above. I'm using HDP Sandbox to learn about these products, so any pointers to good books/blogs for Hive 3+ config & use would be helpful (I'm using O'Reilly's Programming Hive by Edward Capriolo, Dean Wampler, Jason Rutherglen, which is good but published in 2012).
Solution: Check ACLs for HDFS folders & set access for user account hive if required.
It seems that Hive 3.1.0 executes HiveSQL commands as user hive even if you launch them from the Linux command-line with Beeline as another user (e.g. maria_dev). Thus, the hive account needs to have access to any HDFS resources required (i.e. folders & files). Also, it relies on access control lists (ACLs) for this (sometime, just setting the right group access via chmod isn't enough).
Use the following to check ACL permissions:
hdfs dfs -getfacls <some_hdfs_path>
Use the following to set ACL permissions:
hdfs dfs -setfacls -m user:hive:rwx <some_hdfs_path>
However, sometimes this still didn't work as expected. For e.g., after setting the following to ensure any future files get the right ACLs, I'd find they'd randomly fail to be set:
hdfs dfs -setfacls -R -m default:user:hive:rwx <some_hdfs_path>
Then I started getting several other issues while trying to learn Hive with DAS & Spark SQL with Zeppelin.
In Data Analytics Studio, I would be able to create my training database and add tables to it, but was never able to view any tables via the DAS UI (unless I ran the show tables command).
Also, YARN resources were hitting 100% while executing simple queries with a small dataframe, just before services seemed to crash (the remote Zeppelin, Ambari & Shell-in-a-Box sessions all disconnected even though the VM still appeared to be running in VirtualBox). Then all services failed to restart this morning.
Thus, I deleted the VM & re-installed HDP Sandbox 3.0.1.
Straight after re-installing and re-creating my training database, I was able to view all tables created in it and Spark SQL is working fine.
While re-creating the VM is fine in this instance, if anyone can direct me to reading material on how folks handle such problems in live production environments, I'd appreciate it. I'm guessing re-installing the VM is a last resort...