Created 08-09-2016 02:31 PM
I am new to this so just want to understand how this works. If I connect using hive using beeline on command prompt,and then look at the available databases (show databases), I see a different set from what I get if I connect typing hive directly from the edge node.
Can someone please explain why these are different.
Created 08-09-2016 02:46 PM
Hello,
You'll find some useful information on: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_dataintegration/content/beeline-vs-hive-...
Essentially beeline would use the JDBC/Thrift (or alternatively HTTP) protocol to communicate with HiveServer2.
HiveServer2 then handles the hive logic (finding the table definition in the metastore, reading the data from HDFS, etc).
On the other hand, the hive shell access directly the Hive metastore and HDFS data, and bypasses HiveServer2.
The big influence that this difference can have in your situation is security.
Hive security is implemented in HiveServer2, thus Hive shell bypasses any Hive access policies you might have set on specific databases using Ranger or SQL based authorization (only HDFS policies apply in this case).
Created 08-09-2016 02:42 PM
@Khera Hive CLI is legacy client while beeline is the new client that will replace Hive CLI. One of the main differences is beeline jdbc client connects to HS2 (Hive Server 2), while Hive CLI does not.
There are different aspects, I would like to mention security as Hive CLI will not be subject to HS2 authorization while beeline will be subject to HS2 ahortization layer.
But finally Hive CLI is going to be deprecated over beeline. Read more here: https://cwiki.apache.org/confluence/display/Hive/Replacing+the+Implementation+of+Hive+CLI+Using+Beel...
Regards,
Felix
Created 08-09-2016 02:45 PM
Thanks for the reply. Shouldn't hive cli or beeline return the same set of available databases? I just want to understand what could be the possible reasons I am getting different set.
Created 08-09-2016 02:48 PM
@Khera The differences you see are most probably caused by the different authorization methods used. Hive CLI relies only on HDFS authorization (posix permissions), it will list all the databases with read access permission for the caller user. While beeline is subject to HS2 authorization grants, plus HDFS authorization.
Created 08-10-2016 08:44 PM
Thanks Felix Albani. Just to understand better, I got it there are different permissions for the users, and as Hive CLI and beeline authorization is different I see different results. However, through Sqoop I am still able to use the database, which are not listed if I use beeline. So, I assume Sqoop also relies on HDFS authorization, is that correct? And also I understand from the comments, we should avoid using Hive CLI but is there any way from security stand point to block it, like not allowing user to run hive command directly?
Thanks
Created 08-09-2016 02:46 PM
Hello,
You'll find some useful information on: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_dataintegration/content/beeline-vs-hive-...
Essentially beeline would use the JDBC/Thrift (or alternatively HTTP) protocol to communicate with HiveServer2.
HiveServer2 then handles the hive logic (finding the table definition in the metastore, reading the data from HDFS, etc).
On the other hand, the hive shell access directly the Hive metastore and HDFS data, and bypasses HiveServer2.
The big influence that this difference can have in your situation is security.
Hive security is implemented in HiveServer2, thus Hive shell bypasses any Hive access policies you might have set on specific databases using Ranger or SQL based authorization (only HDFS policies apply in this case).
Created 08-09-2016 02:54 PM
Created 08-09-2016 04:20 PM
Thanks for sharing this.
Created 08-10-2016 08:48 PM
Sunile Manjee Is there any way we can block user from using Hive CLI, or its just practice? We are promoted to use beeline but accessing Hive CLI is normal from security stand point or are we missing something? And if we use Sqoop, we are able to perform operation on the databases which are not even listed if we use beeline. How does this work?
Thanks
Created 08-11-2016 08:22 PM
Well, this is the basis of security in Hadoop.
In a nutshell, the following separate authorization policies apply:
And authorization would be meaningless if you don't have any authentication (Kerberos) as anyone can impersonate anyone or the admin - hdfs user.
You can think of HiveServer2 and beeline as similar with how a "normal" database operates: a process + a user owning that process and all the files that process writes - in this case hive is the user owning all files under /apps/warehouse/hive.
But in Hadoop other users can also write those files, via Pig, Sqoop, Hive CLI, etc, bypassing the HiveServer2 "database service".
So the only way to prevent that is by using HDFS permissions, for example don't allow the user running the sqoop or hive cli to access some of the hive database folders, but that would be meaningless if you don't have Kerberos as anyone can become the hdfs user (and you cannot block the hive cli as anyone with a shell access can execute the hdfs command to still read those database files).
You can also think in terms of network access and type of users.
For example, the users running sqoop or hdfs commands are data engineers/scientist or a scheduled oozie service user that normally would have access to most of the data and have shell access to the edge or other nodes.
While other users, users that normally consume the data (for example analysts using Tableau), would not have shell access and would only have access to the HiverServer2 port, thus enforcing the permissions would be easier in this case.
By default there's no Hive authentication, but with this specific access pattern you could configure LDAP authentication only for HiveServer2 (or Knox) and not needing Kerberos as these type of users cannot access the cluster other than the HiveServer2 port anyway.