Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Connecting hive - Beeline vs hive?

avatar
Contributor

I am new to this so just want to understand how this works. If I connect using hive using beeline on command prompt,and then look at the available databases (show databases), I see a different set from what I get if I connect typing hive directly from the edge node.

Can someone please explain why these are different.

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Hello,

You'll find some useful information on: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_dataintegration/content/beeline-vs-hive-...

Essentially beeline would use the JDBC/Thrift (or alternatively HTTP) protocol to communicate with HiveServer2.

HiveServer2 then handles the hive logic (finding the table definition in the metastore, reading the data from HDFS, etc).

On the other hand, the hive shell access directly the Hive metastore and HDFS data, and bypasses HiveServer2.

The big influence that this difference can have in your situation is security.

Hive security is implemented in HiveServer2, thus Hive shell bypasses any Hive access policies you might have set on specific databases using Ranger or SQL based authorization (only HDFS policies apply in this case).

View solution in original post

10 REPLIES 10

avatar

@Khera Hive CLI is legacy client while beeline is the new client that will replace Hive CLI. One of the main differences is beeline jdbc client connects to HS2 (Hive Server 2), while Hive CLI does not.

There are different aspects, I would like to mention security as Hive CLI will not be subject to HS2 authorization while beeline will be subject to HS2 ahortization layer.

But finally Hive CLI is going to be deprecated over beeline. Read more here: https://cwiki.apache.org/confluence/display/Hive/Replacing+the+Implementation+of+Hive+CLI+Using+Beel...

Regards,

Felix

avatar
Contributor

Thanks for the reply. Shouldn't hive cli or beeline return the same set of available databases? I just want to understand what could be the possible reasons I am getting different set.

avatar

@Khera The differences you see are most probably caused by the different authorization methods used. Hive CLI relies only on HDFS authorization (posix permissions), it will list all the databases with read access permission for the caller user. While beeline is subject to HS2 authorization grants, plus HDFS authorization.

avatar
Contributor

Thanks Felix Albani. Just to understand better, I got it there are different permissions for the users, and as Hive CLI and beeline authorization is different I see different results. However, through Sqoop I am still able to use the database, which are not listed if I use beeline. So, I assume Sqoop also relies on HDFS authorization, is that correct? And also I understand from the comments, we should avoid using Hive CLI but is there any way from security stand point to block it, like not allowing user to run hive command directly?

Thanks

avatar
Super Collaborator

Hello,

You'll find some useful information on: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_dataintegration/content/beeline-vs-hive-...

Essentially beeline would use the JDBC/Thrift (or alternatively HTTP) protocol to communicate with HiveServer2.

HiveServer2 then handles the hive logic (finding the table definition in the metastore, reading the data from HDFS, etc).

On the other hand, the hive shell access directly the Hive metastore and HDFS data, and bypasses HiveServer2.

The big influence that this difference can have in your situation is security.

Hive security is implemented in HiveServer2, thus Hive shell bypasses any Hive access policies you might have set on specific databases using Ranger or SQL based authorization (only HDFS policies apply in this case).

avatar
Master Guru

@Khera

for all intents and purposes stop using hive CLI. It will soon be deprecated in favor of beeline. Jira here. It does not integrate with ranger so it bypass the security policies. I recommend you only use beeline.

avatar
Contributor

Thanks for sharing this.

avatar
Contributor

Sunile Manjee Is there any way we can block user from using Hive CLI, or its just practice? We are promoted to use beeline but accessing Hive CLI is normal from security stand point or are we missing something? And if we use Sqoop, we are able to perform operation on the databases which are not even listed if we use beeline. How does this work?

Thanks

avatar
Super Collaborator

Well, this is the basis of security in Hadoop.

In a nutshell, the following separate authorization policies apply:

  • beeline -> Hive policies in Ranger (or SQL based authorization)
  • hive cli and sqoop -> HDFS policies in Ranger (or HDFS POSIX permissions)

And authorization would be meaningless if you don't have any authentication (Kerberos) as anyone can impersonate anyone or the admin - hdfs user.

You can think of HiveServer2 and beeline as similar with how a "normal" database operates: a process + a user owning that process and all the files that process writes - in this case hive is the user owning all files under /apps/warehouse/hive.

But in Hadoop other users can also write those files, via Pig, Sqoop, Hive CLI, etc, bypassing the HiveServer2 "database service".

So the only way to prevent that is by using HDFS permissions, for example don't allow the user running the sqoop or hive cli to access some of the hive database folders, but that would be meaningless if you don't have Kerberos as anyone can become the hdfs user (and you cannot block the hive cli as anyone with a shell access can execute the hdfs command to still read those database files).

You can also think in terms of network access and type of users.

For example, the users running sqoop or hdfs commands are data engineers/scientist or a scheduled oozie service user that normally would have access to most of the data and have shell access to the edge or other nodes.

While other users, users that normally consume the data (for example analysts using Tableau), would not have shell access and would only have access to the HiverServer2 port, thus enforcing the permissions would be easier in this case.

By default there's no Hive authentication, but with this specific access pattern you could configure LDAP authentication only for HiveServer2 (or Knox) and not needing Kerberos as these type of users cannot access the cluster other than the HiveServer2 port anyway.