Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

sentry + hive + kerberos resource management

avatar
Rising Star

Hi

 

I have enabled Sentry to work with HiveServer2 with Kerberos Authentication. Therefore, impersonication on HiveServer2 is turned off.

Now all queries are run as 'hive' from Hue Hive UI, and oozie hive action.

 

How does resource management (YARN resource pool) works in this case? I want jobs to go into right pool, but now all Hive jobs are going into root.hive pool.

 

Samething happens with Impala when using llma. All impala jobs goes into root.llama pool.

 

Thank you

Ben

1 ACCEPTED SOLUTION

avatar
Super Collaborator

The fact that the job runs as the hive user is correct. You have impersonation turned off when you turned on Sentry, at least that is what you should have done. The Hive user is thus the user that executes the job.

However the end user should be used to retrieve which queue the application is submitted in (if you use the FairScheduler). This does require some configuration on your side to make this work. There is a Knowledge Base article in our support portal on how to set that up for CM and non CM clusters. Search for "Hive FairScheduler".

 

 

I can remember already providing the steps using CM before on the forum:

 

  1. Login to Cloudera Manager
  2. Navigate to Cluster > Yarn > Instances > ResourceManager > Processes
  3. Click on the link fair-scheduler.xml, this will open a new tab or window
  4. Copy the contents into the a new file called: fair-scheduler.xml
  5. On the HiveServer2 host create a new directory to store the xml file (for example, /etc/hive/fsxml)
    Note: This file should not be placed in the standard Hive configuration directory since that directory is managed by Cloudera Manager and the file could be removed when changing other configuration settings.
  6. Upload the fair-scheduler.xml file to the above created directory
  7. In Cloudera Manager navigate to Cluster > Hive > Service-Wide > Advanced > Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml and add the following property:
    <property>
      <name>yarn.scheduler.fair.allocation.file</name>
      <value>/etc/hive/fsxml/fair-scheduler.xml</value>
    </property>
  8. Save changes
  9. Restart the Hive Service

 NOTE: you must have the follwoing rule as the first rule in the placement policy:

<rule name="specified" />

 

Wiflred

View solution in original post

17 REPLIES 17

avatar
Hi Ben,

The hive+sentry issue sounds like an issue that was fixed in CDH5.2.1 and CDH5.3.0+. What version of CDH are you using? Are you seeing the same problem for jobs launched from command-line (as a user other than hive, of course), or only ones launched through hue and oozie?

I'm not sure if we had any releases where llama was known to have this issue.

Thanks,
Darren

avatar
Rising Star

sorry for late response Darren,

 

I'm using CDH 5.4.1.

 

This doesn't happen from command-line. If I'm authenticated as ben on shell environment, then the job gets submitted as ben.

 

On hue+oozie environment, if I submit a workflow job, oozie job-launcher get's submitted as the authenticated user ben. However actual hive job gets submitted as hive user.

 

Thank you.

Ben

avatar
Rising Star

what's the issue tracking URL on 5.2.1 release? can't find it on Google 😞

avatar
When you ran from command-line, did you use "hive" or "beeline"? I forgot to clarify that you should test using the "beeline" client so it goes through HS2 and fully integrates with Sentry. This is also more similar to how Hue works (it talks to HS2).

When using Sentry, you are supposed to disable impersonation for HS2, which means that all jobs will be submitted as user "hive". When looking up permissions in Sentry and/or deciding which YARN pool to run in, however, it should use the submitting user, not "hive". So it isn't necessarily a problem that the hive job is submitted as the hive user.

Thanks,
Darren

avatar
Rising Star

I tested on both hive and beeline, and running from command line works as intented. jobs get assigned to correct user/group queues.

 

Can you explain why it's ok for hive jobs get submitted as 'hive' user?

 

We have four different teams using Cloudera, and it gets difficult to manage resources if all hive jobs go to "root.hive" queue. And since "root.hive" queue has limited resouces allocated, most hive jobs will fail.

 

This is our job history.

 

application_1436195699910_0031hiveINSERT INTO TABLE ...(Stage-1)MAPREDUCEroot.hiveMon Jul 6 15:44:38 -0500 2015Mon Jul 6 15:45:11 -0500 2015FINISHEDSUCCEEDED  
application_1436195699910_0030benoozie:launcher:T=hive2:W=JobName:A=hive2-6df2:ID=0000004-150706101622653-oozie-oozi-WMAPREDUCEroot.infraMon Jul 6 15:44:22 -0500 2015Mon Jul 6 15:45:21 -0500 2015FINISHEDSUCCEEDED  

 

other workflow actions such as sqoop/pig run on correct user/group queue.

 

I think this is problem with our cluster configuration, but please guide us with right direction 🙂

 

thank you for your help

Ben

avatar
Super Collaborator

The fact that the job runs as the hive user is correct. You have impersonation turned off when you turned on Sentry, at least that is what you should have done. The Hive user is thus the user that executes the job.

However the end user should be used to retrieve which queue the application is submitted in (if you use the FairScheduler). This does require some configuration on your side to make this work. There is a Knowledge Base article in our support portal on how to set that up for CM and non CM clusters. Search for "Hive FairScheduler".

 

 

I can remember already providing the steps using CM before on the forum:

 

  1. Login to Cloudera Manager
  2. Navigate to Cluster > Yarn > Instances > ResourceManager > Processes
  3. Click on the link fair-scheduler.xml, this will open a new tab or window
  4. Copy the contents into the a new file called: fair-scheduler.xml
  5. On the HiveServer2 host create a new directory to store the xml file (for example, /etc/hive/fsxml)
    Note: This file should not be placed in the standard Hive configuration directory since that directory is managed by Cloudera Manager and the file could be removed when changing other configuration settings.
  6. Upload the fair-scheduler.xml file to the above created directory
  7. In Cloudera Manager navigate to Cluster > Hive > Service-Wide > Advanced > Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml and add the following property:
    <property>
      <name>yarn.scheduler.fair.allocation.file</name>
      <value>/etc/hive/fsxml/fair-scheduler.xml</value>
    </property>
  8. Save changes
  9. Restart the Hive Service

 NOTE: you must have the follwoing rule as the first rule in the placement policy:

<rule name="specified" />

 

Wiflred

avatar
Rising Star

Tara!

Thank very much for your help. Now I understand that the job runs as hive user but the job will go to the designated queue. And after following your steps it worked 🙂

 

Initially I changed Placement Rules on resource pools and did not have "specified" pool as first rule.

 

Do I need to replace the local /etc/hive/fsxml/fair-scheduler.xml everytime I make changes to the "Dynamic Resource Pools"? I'm using CM cluster.

 

Best,

Ben

avatar
Until this bug is fixed, yes, you'll need to replace /etc/hive/fsxml/fair-scheduler.xml every time you change Yarn's copy of fair-scheduler.xml.

Thanks,
Darren

avatar
New Contributor

Hi, we have a similar issue and wondering if those steps listed are the resolution.

 

we have our cluster kerberised and we also deployed Sentry, as part of the setup in hive we disabled impersonation. so all the HIVE queries are being executed by the HIVE user.
We configured Dynamic resource manager pools, setting up 3 queues. HighPriority, LowPriority and Default.
Everybody can submit jobs to the default queue, that is working as expected.
The HighPriority, LowPriority are managed by group membership to two different AD groups.

I assigned a test user both groups so it could submit jobs to both queues (HighPriority, LowPriority) when i submitted a job
we got the following error message

ERROR : Job Submission failed with exception 'java.io.IOException(Failed to run job : User hive cannot submit applications to queue root.HighPriority)'
java.io.IOException: Failed to run job : User hive cannot submit applications to queue root.HighPriority

this is correct because the hive user doesn't is not a member of any of those groups.
I modified the submission access control to add the hive user to the pool and this time the job completed, however that breaks the access control model we are trying to implement because now all hive users can make use of both pools even though they don't belong any of the AD groups that are supposed to be controlling who can submit jobs to the pool.

Is there a way to control which users can submit to specific resource pools in HIVE and leverage the Ad groups created for this purpose?