Created 07-05-2016 03:34 PM
I am trying to join two wide tables and save the result to a new table. They happen to be CSV on disk...
Now because the tables are very wide I don't really want to manually create all the columns myself - but never fear - they are just the union of the old columns. So I expect to be able to do a Create Table As Select command like this
CREATE TABLE new_table AS select * from source
but it is more like this
CREATE TABLE combined_table AS SELECT * FROM `default.left_table` A1, `default.right_table` B1 where A1.the_key = B1.the_key
or
CREATE TABLE combined_table AS SELECT * FROM (SELECT * `default.left_table` A1, `default.right_table` B1 where A1.the_key = B1.the_key);
Neither work for me. Is this supposed to work in Hive? If I am saving the results of my query to a new table do I always have to know the columns in that result set and create the table with those columns first?
(I am using HDP 2.4.2, so a fairly recent Hive)
Thanks
Created 07-05-2016 04:14 PM
I was about to say its not possible. I assume both tables have a column with the same name which is one of the reasons database schemas often prefix column names with a letter from the table. Like TPCH: lineitem.l_key and orders.o_key.
In this case you would have to suck it up and name all columns in the join using the AS statement.
However it looks like Hive has some cool/dirty tricks up its sleeve. Regular Expressions to specify column names.
Here is my setup:
hive> describe tb1; id int name string hive> describe tb2; id int age int
You then have to disable the use of quotes in identifiers because that interferes with the regex.
hive> set hive.support.quoted.identifiers=none;
And now you can use Java Regex to select all columns from the right table that is NOT the key. So essentially you get all columns but the duplicate. If you have non-join key columns that are duplicate you can exclude them and rename them with the AS statement after:
hive> create table tb3 as select tb1.*, tb2.`(id)?+.+` from tb1, tb2 where tb1.id = tb2.id;
You can see that I select all columns from the left table and then use the `` quotes to specify a regular expression for the columns from the right side I want. The regex is a mean trick essentially asking for any string unless it starts with id ( it says id once or not at all and something following. This means if it matches the full string it will not be matched because the remainder of the regex needs to match something. You could also do (id|id2)?+.+ or (id*)?+.+.
This gives me a result table with all columns from the left table and all columns but the key column from the right table.
hive> describe tb3; id int name string age int
Neat. You never stop learning something new. Hive is really cool.
Edit: Actually made a little article out of the question because these regex would have made my life much easier multiple times before. So giving a bit more attention to it seems to make sense:
Created 07-05-2016 04:14 PM
I was about to say its not possible. I assume both tables have a column with the same name which is one of the reasons database schemas often prefix column names with a letter from the table. Like TPCH: lineitem.l_key and orders.o_key.
In this case you would have to suck it up and name all columns in the join using the AS statement.
However it looks like Hive has some cool/dirty tricks up its sleeve. Regular Expressions to specify column names.
Here is my setup:
hive> describe tb1; id int name string hive> describe tb2; id int age int
You then have to disable the use of quotes in identifiers because that interferes with the regex.
hive> set hive.support.quoted.identifiers=none;
And now you can use Java Regex to select all columns from the right table that is NOT the key. So essentially you get all columns but the duplicate. If you have non-join key columns that are duplicate you can exclude them and rename them with the AS statement after:
hive> create table tb3 as select tb1.*, tb2.`(id)?+.+` from tb1, tb2 where tb1.id = tb2.id;
You can see that I select all columns from the left table and then use the `` quotes to specify a regular expression for the columns from the right side I want. The regex is a mean trick essentially asking for any string unless it starts with id ( it says id once or not at all and something following. This means if it matches the full string it will not be matched because the remainder of the regex needs to match something. You could also do (id|id2)?+.+ or (id*)?+.+.
This gives me a result table with all columns from the left table and all columns but the key column from the right table.
hive> describe tb3; id int name string age int
Neat. You never stop learning something new. Hive is really cool.
Edit: Actually made a little article out of the question because these regex would have made my life much easier multiple times before. So giving a bit more attention to it seems to make sense:
Created 07-06-2016 11:54 AM
Wow, excellent. Thanks. I did not realise the problem was my key had the same name in both tables. Using java regexes as you described solved my problem. Thanks!
Created 07-06-2016 10:01 AM
Hi Alex McLintock, It raise semantic exception which means failed during parsing your query. You cant use CTAS if you have duplicate columns in table. Another options suggested by Benjamin Leonhardi is good if you know your duplicate columns candidate list.
Created 07-06-2016 11:55 AM
Thanks, yes there was a duplicate column (the key in both tables) and I did not realise that was preventing me from using CTAS.