Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Updated information on differences between External and Internal Hive tables?

avatar
Rising Star

A user recently asked about locking hive tables to make sure reads are consistent, and that led me to the Apache documentation on hive transactions where I saw the following:

External tables cannot be made ACID tables since the changes on external tables are beyond the control of the compactor (HIVE-13175).

This leads me to wonder whether updated/comprehensive documentation exists on the differences between internal and external tables in hive. Traditionally, the explanation of the difference between the two has been that hive maintains both the data and metadata with internal tables, so dropping an internal table will drop the data and metadata, while dropping an external table will only drop the metadata, but otherwise, they're functionally equivalent. The note above regarding ACID/transactions suggests internal and external table capabilities/features are diverging....

Thoughts?

Thanks in advance!

1 ACCEPTED SOLUTION

avatar
Explorer

Hive (internal) tables are meant to be fully managed by Hive for both data and metadata (schema). That's not true for external tables. External tables means Hive doesn't own the data per se, but only shares it as one of the applications. Since ACID will need to have complete control of the data, for example, it needs to manage the directory layout, perform compaction, clean up old files and so on, we want to avoid potential interference issues.

I understand in your case you may only use the data for Hive only, but the fact that by design external tables can be used by anything else provides enough evidence that disabling ACID on them has to be enforced.

As to your use case, you may want to solve the governance issues by creating difference roles or even better, use Ranger to make things easier: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_Sys_Admin_Guides/content/ch_hive_auth.ht...

View solution in original post

4 REPLIES 4

avatar

Hi @Vincent Romeo I think I know what's going on here. The issue isn't that there is something technically different between external and internal tables but, instead there is a design expectation between the two functionalities. A user will use external tables because they expect the data to not change. In this way, you could have multiple schemas applied to the same data set without fear of any one user deleting or changing the data and, if they decide to drop a table, the data isn't removed.

The sole purpose of ACID is to insert, update, and delete data so this goes against the basic premise of why you would use external tables. To adhere to this expectation, the developers essentially disable the ability to run ACID on external tables, i.e. disable compaction which is the change mechanism for Hive ACID.

Hope this helps!

avatar
Rising Star

Unfortunately, my group was using external tables as easier way to deal with quotas in a "multi-tenant" cluster and impose some governance on hive. (i.e. Most users/groups can only create external tables, and the files need to be landed in their assigned folder in HDFS. DBAs control internal tables in hive.) Somewhere, we missed the "basic premise" that the data in external tables won't change....

avatar

@Vincent Romeo. Your use case makes a lot of sense. I don't know for sure but you might be able to override the setting. Adding Wei to the conversation.

+ @Wei Zheng

avatar
Explorer

Hive (internal) tables are meant to be fully managed by Hive for both data and metadata (schema). That's not true for external tables. External tables means Hive doesn't own the data per se, but only shares it as one of the applications. Since ACID will need to have complete control of the data, for example, it needs to manage the directory layout, perform compaction, clean up old files and so on, we want to avoid potential interference issues.

I understand in your case you may only use the data for Hive only, but the fact that by design external tables can be used by anything else provides enough evidence that disabling ACID on them has to be enforced.

As to your use case, you may want to solve the governance issues by creating difference roles or even better, use Ranger to make things easier: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_Sys_Admin_Guides/content/ch_hive_auth.ht...