Created 03-09-2017 12:55 PM
I would like to create a hive database. I thought every database is created under /apps/hive/warehouse/ but now i see that i can create a schema with the property "location". All tables under my database are 'External tables'. Is there a disadvantage if i create a database anyhwere else than /apps/hive/warehouse/?
Created 03-09-2017 09:21 PM
The directory for the database is mostly used to store table data, so there isn't a lot of difference if you are only going to use external tables. Is there any reason you want to use a different directory?
Created 03-16-2017 08:06 AM
Hi @sergey,
thanks for your answer. We have a usecase dependent directory structure where all data for a usecase is stored. The reason for that is the answer from @Ed Berezitsky. Every business unit has its own database with data stored in the usecase folder. So in my opinion it is better to have a external database because all data is stored external rather than having an internal database without any underlying tables.
What do you think about that?
Created 03-10-2017 02:27 PM
It's not about "better". Technically, there is no difference. But in a lot of companies, there are naming conventions for dirs, so you have to create DB in locations as per requirements. Also, it's a bit easier to manage security policies based on some structure of dirs instead of having flattened dirs in save location.
Created 03-10-2017 03:36 PM
Hi @Florian Rabl. There is a very good discussion on internal and external tables - and what happens when you do or don't specify a location. It's located here - give it a look.
https://community.hortonworks.com/questions/63991/hive-location.html
My personal opinion is that only Internal tables should be stored in /apps/hive/warehouse, and that external tables should always be stored in a different location. But there is no hard & fast rule - I only favor that because it marks a clear line where internal and external tables are stored, and makes it easy to know just by looking at the filesystem if a table is internal or external.
Also don't forget that internal and external table data is treated differently when you issue a 'DROP TABLE' statement: HDFS data corresponding to the internal table is deleted, while HDFS data corresponding to the External table is not deleted. So external table data (in /apps/hive/warehouse) would not be deleted, and ends up being a 'straggler' in the filesystem when you drop other Internal tables or databases.
Created 03-10-2017 11:42 PM
There are also some restrictions, for example the target for the CTAS command (CREATE TABLE t1 AS SELECT ...) cannot be an external table. And Ranger plugin for Hive works best on managed (non-external) tables. I usually use external tables for raw data which can land anywhere on HDFS, and then create managed ORC tables from external ones.