Support Questions

Find answers, ask questions, and share your expertise

Can I do a Hive CTAS on the results of a join?

avatar
Expert Contributor

I am trying to join two wide tables and save the result to a new table. They happen to be CSV on disk...

Now because the tables are very wide I don't really want to manually create all the columns myself - but never fear - they are just the union of the old columns. So I expect to be able to do a Create Table As Select command like this

CREATE TABLE new_table AS select * from source

but it is more like this

CREATE TABLE combined_table AS SELECT * FROM `default.left_table` A1,  `default.right_table` B1 
    where A1.the_key = B1.the_key

or

CREATE TABLE combined_table AS SELECT * FROM (SELECT * `default.left_table` A1, `default.right_table` B1 where A1.the_key = B1.the_key);

Neither work for me. Is this supposed to work in Hive? If I am saving the results of my query to a new table do I always have to know the columns in that result set and create the table with those columns first?

(I am using HDP 2.4.2, so a fairly recent Hive)

Thanks

1 ACCEPTED SOLUTION

avatar
Master Guru

I was about to say its not possible. I assume both tables have a column with the same name which is one of the reasons database schemas often prefix column names with a letter from the table. Like TPCH: lineitem.l_key and orders.o_key.

In this case you would have to suck it up and name all columns in the join using the AS statement.

However it looks like Hive has some cool/dirty tricks up its sleeve. Regular Expressions to specify column names.

Here is my setup:

hive> describe tb1;
id                  	int                 	                    
name                	string              	                    
hive> describe tb2;
id                  	int                 	                    
age                 	int

You then have to disable the use of quotes in identifiers because that interferes with the regex.

hive> set hive.support.quoted.identifiers=none;

And now you can use Java Regex to select all columns from the right table that is NOT the key. So essentially you get all columns but the duplicate. If you have non-join key columns that are duplicate you can exclude them and rename them with the AS statement after:

hive> create table tb3 as select tb1.*, tb2.`(id)?+.+`  from tb1, tb2 where tb1.id = tb2.id;

You can see that I select all columns from the left table and then use the `` quotes to specify a regular expression for the columns from the right side I want. The regex is a mean trick essentially asking for any string unless it starts with id ( it says id once or not at all and something following. This means if it matches the full string it will not be matched because the remainder of the regex needs to match something. You could also do (id|id2)?+.+ or (id*)?+.+.

This gives me a result table with all columns from the left table and all columns but the key column from the right table.

hive> describe tb3;
id                  	int                 	                    
name                	string              	                    
age                 	int    

Neat. You never stop learning something new. Hive is really cool.

Edit: Actually made a little article out of the question because these regex would have made my life much easier multiple times before. So giving a bit more attention to it seems to make sense:

https://community.hortonworks.com/articles/43510/excluding-duplicate-key-columns-from-hive-using-re....

View solution in original post

4 REPLIES 4

avatar
Master Guru

I was about to say its not possible. I assume both tables have a column with the same name which is one of the reasons database schemas often prefix column names with a letter from the table. Like TPCH: lineitem.l_key and orders.o_key.

In this case you would have to suck it up and name all columns in the join using the AS statement.

However it looks like Hive has some cool/dirty tricks up its sleeve. Regular Expressions to specify column names.

Here is my setup:

hive> describe tb1;
id                  	int                 	                    
name                	string              	                    
hive> describe tb2;
id                  	int                 	                    
age                 	int

You then have to disable the use of quotes in identifiers because that interferes with the regex.

hive> set hive.support.quoted.identifiers=none;

And now you can use Java Regex to select all columns from the right table that is NOT the key. So essentially you get all columns but the duplicate. If you have non-join key columns that are duplicate you can exclude them and rename them with the AS statement after:

hive> create table tb3 as select tb1.*, tb2.`(id)?+.+`  from tb1, tb2 where tb1.id = tb2.id;

You can see that I select all columns from the left table and then use the `` quotes to specify a regular expression for the columns from the right side I want. The regex is a mean trick essentially asking for any string unless it starts with id ( it says id once or not at all and something following. This means if it matches the full string it will not be matched because the remainder of the regex needs to match something. You could also do (id|id2)?+.+ or (id*)?+.+.

This gives me a result table with all columns from the left table and all columns but the key column from the right table.

hive> describe tb3;
id                  	int                 	                    
name                	string              	                    
age                 	int    

Neat. You never stop learning something new. Hive is really cool.

Edit: Actually made a little article out of the question because these regex would have made my life much easier multiple times before. So giving a bit more attention to it seems to make sense:

https://community.hortonworks.com/articles/43510/excluding-duplicate-key-columns-from-hive-using-re....

avatar
Expert Contributor

Wow, excellent. Thanks. I did not realise the problem was my key had the same name in both tables. Using java regexes as you described solved my problem. Thanks!

avatar

Hi Alex McLintock, It raise semantic exception which means failed during parsing your query. You cant use CTAS if you have duplicate columns in table. Another options suggested by Benjamin Leonhardi is good if you know your duplicate columns candidate list.

avatar
Expert Contributor

Thanks, yes there was a duplicate column (the key in both tables) and I did not realise that was preventing me from using CTAS.