Created 09-21-2021 08:22 PM
Is it possible for Hive Metastore to support multi-tenancy?
I'm trying to build a multi-tenant feature in Hive Metastore. It means that there will be multiple user and each user can create databases, tables, etc. Also, the database or tables created by that user cannot be accessed by another users.
I've thought maybe by providing an authentication to Hive will give a multi-tenancy feature. But after tried it with Kerberos, it seems that different users cannot create database with the same name, for example when user1 have a database named dbA then user2 cannot create database named dbA. I've not tried using authorization tools like Ranger, but it looks like it will not give me multi-tenancy feature (different user can create database with the same name).
There are several solutions that I can think of, that is:
- Each user have their own Metastore. It seems inefficient to have a lot of hive metastore, but it is the most straight forward solution to me.
- Include username in database name. Database name will have this format: <username>_<database-name>. But I'm not so sure about this one.
I've been searching about this topic but found very little information. The official Hive documentation doesn't even mention about it, or maybe I missed it (?). Or are there any other method/solution to implement multi-tenant feature in Hive Metastore?
Thank you
Created 09-24-2021 01:45 PM
@mailsmail If you are planning to user a single metastore, then you could not create two databases by same name. So, as you pointed out, users could create separate schemas, and if required they could create tables by same names under those schemas:
user1:
create database user1db1 [LOCATION]
create table user1db1.tbl1 ...
user2:
create database user2db1 [LOCATION]
create table user2db1.tbl1 ...
Created on 09-25-2021 06:42 AM - edited 09-25-2021 06:43 AM
@smruti thanks for the reply. So, recently i found this Jira issue about adding catalog feature to Hive https://issues.apache.org/jira/browse/HIVE-18755 . It seems that with catalog, it is possible to create a database with the same name under different catalog. It's a new feature, I think. Added to the newest Hive (version 3). But, I can't find any doc for how to use it. Looking to the thrift file, there's a function to manage the catalog (create, update, delete, and get). But there's no function the get all databases under that catalog. Maybe because it's a new feature, so it's still under development. Or, maybe I'm misunderstood the purpose of this feature, or I missed the docs.
I also want to ask, does any authentication and authorization tools like Kerberos or Ranger do not affect the creation of a database with the same name even though the user is different?
Created 09-25-2021 11:57 AM
@mailsmail I am afraid authorisation tools might not be able to help us here. HIVE-18755 is talking about having two separate catalogs named hive and spark, under Hive metastore db for Hive and Spark services respectively.
Here, you could either create databases with different names, or have multiple metastore instances, pointing them to different HMS databases. Then, you will need to have individual HS2 instance(with help of Config groups) connecting to separate Metastores.
Created 09-25-2021 05:33 AM
You can have below approach:
1. create a separate schema in metastore db named user2
2. Add a hivemetastore pointing to same MySQL but different db user2 on hostA
3. Add a separate hiveserver2 on hostA
4. Create a config group and add the hostA
5. Override metastore uri for hiveserver2
This is one way you can achieve multitenancy.Your databases in hive would remain independent. Please let us know if you have additional quries.
Created 09-25-2021 06:58 AM
@asish Ah, I see. That means I don't have to create/run new MySQL instance for each metastore.
Is it possible for one hiveserver2 to connect to multiple metastore, so all user will use the same endpoint? or do I have to run another hiveserver2 for each metastore?
Created 09-25-2021 07:09 AM
@mailsmail This is not tested yet. You can create a POC environment and test.
I don't have to create/run new MySQL instance for each metastore.
Ans: A separatevschema is required. Avoid using hive/metastore as this would be by-default. use a different database/schema.
Is it possible for one hiveserver2 to connect to multiple metastore, so all user will use the same endpoint? or do I have to run another hiveserver2 for each metastore?
Ans: You can have one hiveserver2 to connect to multiple HMS by updating "hive.metastore.uris" field.
You need to create a separate config group for another HMS with different database and also you need a different hiveserver2 and update "hive.metastore.uris" accordingly
Please let me know,if you have any queries or accept it as solution,if your queries are answered.
Created on 09-25-2021 07:27 AM - edited 09-25-2021 07:28 AM
@asish When hiveserver2 connect to multiple metastore, how does it know which metastore to use?
Let's say, a user run "SHOW DATABASES;" using beeline and connected to hiveserver2. Does it return all DBs from all metastore?
Another case, both metastore A and metastore B have the same database name, dbC. When user run "SHOW TABLES FROM dbC", will it return all tables from dbC in metastore A, B, or both?
Created 09-25-2021 09:59 AM
Lets take an example
Hiveserver1 ==> HMS ==> Mysqldatabase with schema(hive) ==> Contains databases as students(student database is in hive not MySQL)
Hiveserver2 ==> HMS ==> Mysqldatabase with schema(test) ==> Contains databases as employee(employee database is in hive not MySQL)]
When connected to beeline ==> hiveserve1 ==> Show databases would yield students
When connected to beeline ==> hiveserve2 ==> Show databases would yield employees
This must be performed using config groups in ambari.
In CM it must be used using role instances.