Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Capacity scheduler queue mapping while doAs disabled

Capacity scheduler queue mapping while doAs disabled

Expert Contributor

Hello

Our hive is configured with LDAP auth & doAs disabled (as described in scenario 1 in this lovely article http://hortonworks.com/blog/best-practices-for-hive-authorization-using-apache-ranger-in-hdp-2-2/ )

This means that all jobs are submitted to HiveServer2 are submitted from user "hive". In TEZ view it is plain to see that every job is submitted by "hive". In order to see the real end user that authenticated we use Audit in Ranger. The problem is that if all jobs are submitted as user "hive" - queue mapping is irrelevant. Does it mean that everyone who wants to use queue mapping needs to set doAs to enabled ? Or am i missing here something ? Adi

7 REPLIES 7

Re: Capacity scheduler queue mapping while doAs disabled

@Adi Jabkowsky

See this https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_performance_tuning/content/section_set_u...

When doAs is set to false, queries execute as the Hive user and not the end user. When multiple queries run as the Hive user, they can share resources. Otherwise, YARN does not allow resources to be shared across different users. When the Hive user executes all of the queries, a Tez session opened for one query and is holding onto resources can use those resources for the next query without re-allocation.

Re: Capacity scheduler queue mapping while doAs disabled

Re: Capacity scheduler queue mapping while doAs disabled

Expert Contributor

@Neeraj Sabharwal Thank you for the referrals. I read them but i don't seem to simply understand. I have configured queues according to the best practices which means allocating resources based on departments (hence users) and leave a default queue with minimal resources so jobs that are not configured get minimal resources. (quote from the second URL you posted: "The "default" queue should be left with minimal capacity to ensure jobs that aren't configured, get only minimal resources"). But because "hive" is the user for all jobs (due to doAs being disabled) - all of the jobs get the default queue.

Re: Capacity scheduler queue mapping while doAs disabled

Unfortunately that is the case. If you want queue mapping you need to set doAS as true. Hive queries can still be set into queues with the set tez.queue ... parameter though but its a manual step.

Re: Capacity scheduler queue mapping while doAs disabled

Rising Star

@Adi Jabkowsky

Following the answers provided by @Benjamin Leonhardi & @Neeraj Sabharwal - here is some further clarification and possible workaround for you.

When hive impersonation is set to false (as it is in your case), all user jobs submitted to HiveServer2 run as the 'hive' user and fall into whichever YARN queue the hive user has been mapped to.

As you've seen, we recommend this approach for:

  1. Column level security via Ranger - to properly enforce column level security in Hive, users must not be able to access the hdfs directory that stores the hive table data (or this would open a loophole that the users could workaround).
  2. Improving Hive performance when using TEZ (as documented in the HW performance tuning docs)

The drawback here is that end users aren't correctly mapped to YARN queues - which poses a problem for managing a multi-tenant cluster. (I've previously posted this as a feature suggestion here - hoping if there is enough interest, we can look at building a permanent fix similar to the one that currently exists in Fair Scheduler)

Until a permanent fix is developed, here are your options:

  1. If you don't need column level security in Hive (and are happy with table/database level security), you can set doAs to true in Hive and manage all security via HDFS policies on the directories that hold the Hive data.
  2. If you want to keep doAs disabled (for the benefits listed above), @Guilherme Braccialli has come up with a work around that involves using a Hive Hook to auto map end users to the correct queue when a query is run.

The Workaround (taken from Guilherme's answer here)

Here the steps and how workaround using hooks works:

1- on hiveserver2 hosts:

mkdir /usr/hdp/current/hive-client/auxlib/
wget https://github.com/gbraccialli/HiveUtils/raw/master/target/HiveUtils-1.0-SNAPSHOT-jar-with-dependenc... -O /usr/hdp/current/hive-client/auxlib/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar

2- on ambari - hive - Custom hiveserver2-site:

hive.semantic.analyzer.hook=com.github.gbraccialli.hive.hooks.UserGroupQueueHook

2214-screen-shot-2016-02-17-at-092704.png3- restart hiveserver2

4- open beeline, this hook is just an example and it will assign as queue name the default group of user used to connect to hive (check output of linux command on your hiveserver2: groups USERNAME). you can implement your own logic in your custom hook like this one:https://github.com/gbraccialli/HiveUtils/blob/master/src/main/java/com/github/gbraccialli/hive/hooks...

Highlighted

Re: Capacity scheduler queue mapping while doAs disabled

New Contributor

I followed the steps above, but am getting this when I try to `show databases`. I am running HDP 2.6.2 Kerberized.

> show databases;
Error: Error while compiling statement: FAILED: ClassNotFoundException com.github.gbraccialli.hive.hooks.UserGroupQueueHook (state=42000,code=40000) 0:
> show tables;
Error: Error while compiling statement: FAILED: ClassNotFoundException com.github.gbraccialli.hive.hooks.UserGroupQueueHook (state=42000,code=40000)

I had to create the ./auxlib directory where I copied the JAR file to. I must be missing a step somewhere.

Re: Capacity scheduler queue mapping while doAs disabled

Expert Contributor

Thank you all

I've decided to enable the doAs because i do want to see who submits jobs, so everything is fine. Btw, in Tez View there is a "queue" column and it always shows "default" - regardless if the job is using a specific queue. Does anyone have a clue what's that column for ?

Adi

Don't have an account?
Coming from Hortonworks? Activate your account here