Support Questions

Find answers, ask questions, and share your expertise

hive impersonation and sentry

avatar
Expert Contributor

HiveServer2 impersonation must be turned off. HiveServer2 impersonation lets users execute queries and access HDFS files as the connected user rather than as the super user. Access policies are applied at the file level using the HDFS permissions specified in ACLs (access control lists). Enabling HiveServer2 impersonation bypasses Sentry from the end-to-end authorization process. Specifically, although Sentry enforces access control policies on tables and views within the Hive warehouse, it does not control access to the HDFS files that underlie the tables. This means that users without Sentry permissions to tables in the warehouse may nonetheless be able to bypass Sentry authorization checks and execute jobs and queries against tables in the warehouse as long as they have permissions on the HDFS files supporting the table.

 

 

 

the above text is from document, i just wonder why "Enabling HiveServer2 impersonation bypasses Sentry from the end-to-end authorization process" ? who can give some advises ? thanks.

7 REPLIES 7

avatar
Super Guru
The point of sentry is to only allow users with specific permission to access certain things. To do that, sentry needs to manage everything by itself, not by end users.

This is why we need to make the hive warehouse to be owned by "hive:hive" and 771, so that no end users can modify anything that hive and sentry control.

Enabling impersonation will make the end user to create files owned by them, which makes "hive" user not able to manage those files/directories and user has direct access to them. That will break what sentry is designed to do.

Hope above makes sense.

Cheers

avatar
New Contributor
But this problem makes things tougher frm the admins perspective, jobs submitted from Hue is running as Hive user on Yarn..
Also most of the users will be creating external tables for thier work n store it at thier respective hdfs path, so setting the path ownership as user:usergrp is prohibitting the "disabled impersonated" hive user from hue,, to unable to write at the mentioned path....

So everytime have to set acl for everyone?..
and every sub directory ownership will change?..
What if the user if running on beeline?.. so still change the path ownership to hive:hive?..

avatar
Super Guru
Hi Sona,

>>> But this problem makes things tougher frm the admins perspective, jobs submitted from Hue is running as Hive user on Yarn..

The jobs will be submitted under queue that is configured in the cluster, so resources can still be controlled based on the end users, not "hive" user.

>>> Also most of the users will be creating external tables for thier work n store it at thier respective hdfs path, so setting the path ownership as user:usergrp is prohibitting the "disabled impersonated" hive user from hue,, to unable to write at the mentioned path....

All HDFS path that you store data for Hive databases/tables should be owned by "hive" and the permissions for end users should be done via Sentry HDFS sentry, by granting permissions to end users via Sentry and ACL will be synced to HDFS. So everything is managed by hive/sentry, and hive/sentry can give permissions to end users.

>>> So everytime have to set acl for everyone?..
You can setup at DB level, so no need to set it for every table

>>> and every sub directory ownership will change?..
Yes

>>> What if the user if running on beeline?.. so still change the path ownership to hive:hive?..

After enabling Sentry, you should have switched to beeline already, Hive CLI is deprecated and will not work properly in Sentry enabled environment.

Hope above helps.

Cheers
Eric

avatar
New Contributor

Hi Eric ,

Thanks for the reply,
(1) In the resource pool, submission access control is set by "groupname", so when user from the group submitting a job through HUE, the Yarn is showing me the username as "hive" whom submitted the job, only upon the job is completed I could view who was the one submitted the job. Also if its a Spark job or Other huge jobs im unable to alert the user, before killing the job, which is very tough to monitor. So how to clearly see who submitted the Job?. when its showing hive everywhere.

 

(2) Hive databases is stored in /user/hive/warehouse/db*, but yet, users are creating tables as *external table* in thier own HDFS path /Project/Alpha/Table/..and in that path users are devided by *dev*sit*prd and etc. Besides just external tables, other files also stored at the same path, so are you suggesting me to leave the setting as hive:hive everywhere and let the "sentry role" to define who access what.?..


Regards,
Sona

avatar

I have the same case where the queries are running as hive user. Is there anyway to detect which user run the query from Hue?

 

We have created a generic user which is shared among a group of analysts, but not able to detect who run the query. Is there any way to identify from Hue Sessions or IP address of the client who runs the query.

avatar
Super Guru
Hi @vinodnerella,

As @Sona mentioned, after job finishes, you can find out the user who ran the job through Cloudera Manager's YARN applications list page. When the job is running, you can find out the actual user who ran the job by checking YARN application's configuration setting called "hive.sentry.subject.name".

If you access through RM, click on the "Configuration" link on the left side of the job details page. If you access through Hue, click through the job details page and go to "Metadata" tab and search "hive.sentry.subject.name". This setting stores the original user who submitted the job, as after Sentry is enabled, impersonation is turned off.

Of course, this only works in sentry enabled environment.

Cheers
Eric

avatar
Super Guru
@Sona,

Sorry I missed your question in May.

For (1), please refer to my previous update.

For (2), yes all paths that store Hive databases/tables should be managed by Hive/Sentry, so those paths should be configured under Sentry Synchronization Path Prefixes setting and need to be owned by "hive:hive". The idea of Sentry is to have everything managed by "hive" so that no one can do direct modifications without going through Hive/Sentry.

Cheers
Eric