Created 02-09-2017 04:19 PM
Is there any way I can restrict a Hive database to use defined space? Or any property to define a space quota for a database in hive. I want to restrict a database to use only 25% of my HDFS storage not more than that.
I am exploring other option where I can restrict hdfs folder which will be the location of db.
Thank you.
Created 02-10-2017 01:28 AM
You can set HDFS quota. @mbalakrishnan is right. It really doesn't work at Hive level. Think about this. You set quota at directory level. Assume that you set your hive warehouse directory to be 25% of HDFS storage. So what. I'll just create an external table. You restrict that external directory, I can create another directory and point my external table to it. So, here is how you can almost achieve it but it's a combination of technology as well as policy you will implement.
1. Assign HDFS quota to directories where users can create table (data warehouse directory as well as external directories).
2. This quota combined should not be more than 25%.
3. Establish an organizational policy that Hive tables must be created only on above directories. If people create table outside of those directories, you should warn them that, that data will be deleted.
4. That's it. Enforce your policy.
Created 02-09-2017 04:20 PM
Update: I checked Ranger as well but did not find any setup related to space limitation.
Created 02-09-2017 09:36 PM
I dont understand the usecase of limiting a database to use a X percent. Lets assume you have a table which needs more than 25% of the disk space then you wont be able to do anything unless you delete and free up the space on that DB to allow data into the table.
Also one cannot set limitation at Hive level to state how much space a database can use. It needs to be set at HDFS user level quota.
-Mahesh
Created 02-09-2017 10:16 PM
Thank you @mbalakrishnan. A simple use case is to limit the space for different teams. I do not want one team to utilize the entire cluster space. For eg: Team A 25%, Team B 25% and Team C 50%. Something like YARN Capacity queues. Space quota per team for cost management.
I will check the same for HDFS. Can you refer some links to check HDFS user level quota?
Created 02-09-2017 10:24 PM
Here is the link which talks about the HDFS quota.
http://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html
-Mahesh
Created 02-10-2017 07:16 PM
Thank you @mbalakrishnan
Created 02-10-2017 01:28 AM
You can set HDFS quota. @mbalakrishnan is right. It really doesn't work at Hive level. Think about this. You set quota at directory level. Assume that you set your hive warehouse directory to be 25% of HDFS storage. So what. I'll just create an external table. You restrict that external directory, I can create another directory and point my external table to it. So, here is how you can almost achieve it but it's a combination of technology as well as policy you will implement.
1. Assign HDFS quota to directories where users can create table (data warehouse directory as well as external directories).
2. This quota combined should not be more than 25%.
3. Establish an organizational policy that Hive tables must be created only on above directories. If people create table outside of those directories, you should warn them that, that data will be deleted.
4. That's it. Enforce your policy.
Created 02-10-2017 02:14 PM
@mqureshi Thank you. Make sense.