Our hive is configured with LDAP auth & doAs disabled (as described in scenario 1 in this lovely article http://hortonworks.com/blog/best-practices-for-hive-authorization-using-apache-ranger-in-hdp-2-2/ )
This means that all jobs are submitted to HiveServer2 are submitted from user "hive". In TEZ view it is plain to see that every job is submitted by "hive". In order to see the real end user that authenticated we use Audit in Ranger. The problem is that if all jobs are submitted as user "hive" - queue mapping is irrelevant. Does it mean that everyone who wants to use queue mapping needs to set doAs to enabled ? Or am i missing here something ? Adi
doAs is set to
false, queries execute as the Hive user and not the end user. When multiple queries run as the Hive user, they can share resources. Otherwise, YARN does not allow resources to be shared across different users. When the Hive user executes all of the queries, a Tez session opened for one query and is holding onto resources can use those resources for the next query without re-allocation.
@Neeraj Sabharwal Thank you for the referrals. I read them but i don't seem to simply understand. I have configured queues according to the best practices which means allocating resources based on departments (hence users) and leave a default queue with minimal resources so jobs that are not configured get minimal resources. (quote from the second URL you posted: "The "default" queue should be left with minimal capacity to ensure jobs that aren't configured, get only minimal resources"). But because "hive" is the user for all jobs (due to doAs being disabled) - all of the jobs get the default queue.
Unfortunately that is the case. If you want queue mapping you need to set doAS as true. Hive queries can still be set into queues with the set tez.queue ... parameter though but its a manual step.
When hive impersonation is set to false (as it is in your case), all user jobs submitted to HiveServer2 run as the 'hive' user and fall into whichever YARN queue the hive user has been mapped to.
As you've seen, we recommend this approach for:
The drawback here is that end users aren't correctly mapped to YARN queues - which poses a problem for managing a multi-tenant cluster. (I've previously posted this as a feature suggestion here - hoping if there is enough interest, we can look at building a permanent fix similar to the one that currently exists in Fair Scheduler)
Until a permanent fix is developed, here are your options:
The Workaround (taken from Guilherme's answer here)
Here the steps and how workaround using hooks works:
1- on hiveserver2 hosts:
mkdir /usr/hdp/current/hive-client/auxlib/ wget https://github.com/gbraccialli/HiveUtils/raw/master/target/HiveUtils-1.0-SNAPSHOT-jar-with-dependenc... -O /usr/hdp/current/hive-client/auxlib/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar
2- on ambari - hive - Custom hiveserver2-site:
3- restart hiveserver2
4- open beeline, this hook is just an example and it will assign as queue name the default group of user used to connect to hive (check output of linux command on your hiveserver2: groups USERNAME). you can implement your own logic in your custom hook like this one:https://github.com/gbraccialli/HiveUtils/blob/master/src/main/java/com/github/gbraccialli/hive/hooks...
I followed the steps above, but am getting this when I try to `show databases`. I am running HDP 2.6.2 Kerberized.
> show databases;
Error: Error while compiling statement: FAILED: ClassNotFoundException com.github.gbraccialli.hive.hooks.UserGroupQueueHook (state=42000,code=40000) 0:
> show tables;
Error: Error while compiling statement: FAILED: ClassNotFoundException com.github.gbraccialli.hive.hooks.UserGroupQueueHook (state=42000,code=40000)
I had to create the ./auxlib directory where I copied the JAR file to. I must be missing a step somewhere.
Thank you all
I've decided to enable the doAs because i do want to see who submits jobs, so everything is fine. Btw, in Tez View there is a "queue" column and it always shows "default" - regardless if the job is using a specific queue. Does anyone have a clue what's that column for ?