Support Questions

Find answers, ask questions, and share your expertise

What's really happening in a Hive view that joins two tables?

avatar
Contributor

Say you have a two tables connected with a meaningless key, i.e., every row from one of the tables corresponds 1:1 to a row in the other table. Maybe the pairs of rows started life in a single table, but it was convenient to store them in two for some reason. To use the data in its complete form you create a view, V, that unites the data, i.e., the join of T1 and T2 where T1.key=T2.key. When you use this view in a query, the result is as if you actually did this join to create a temp table called V, then used V in the query. What you see seems to be different. Hive is free to rearrange the operations more efficiently so long as the result is the same. What does it actually do and how is the strategy determined? This has some implications for us because we'd like to split our data up along lines of requirements for security. What are the implications of this strategy--is the data in it's joined form ever left laying around somewhere exposed?

1 ACCEPTED SOLUTION

avatar
Super Guru

@Peter Coates

Views in Hive today are purely logical. That means there is no physical data laying around (at least not today) for a view. So when you create a view, all you are really doing is making it easier to write future queries on top of that view or in your case creating views to help with compliance and policies. Once a view is created, you can create access policies for that view on who should have access to the view. This is in addition to the policies you may have at table level. Of course if someone have access to T1 and T2 then restricting View permissions is quite meaning less.

In short, no data is laying around for a view after a query completes (almost). I have seen a scenario where temp files created by hive during a query were not being deleted due to query failure. Check this link.

Following link should answer your question in more details.

https://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization

View solution in original post

1 REPLY 1

avatar
Super Guru

@Peter Coates

Views in Hive today are purely logical. That means there is no physical data laying around (at least not today) for a view. So when you create a view, all you are really doing is making it easier to write future queries on top of that view or in your case creating views to help with compliance and policies. Once a view is created, you can create access policies for that view on who should have access to the view. This is in addition to the policies you may have at table level. Of course if someone have access to T1 and T2 then restricting View permissions is quite meaning less.

In short, no data is laying around for a view after a query completes (almost). I have seen a scenario where temp files created by hive during a query were not being deleted due to query failure. Check this link.

Following link should answer your question in more details.

https://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization