Community Articles

bleonhardi · ‎07-05-2016

One common problem in SQL is that you join two tables and get duplicate column names from the two tables. When you now for example want to create a CTAS you will get "duplicate column name" errors. You also often want to exclude the join key from the result set since it is by definition duplicate.

Database schemas often prefix column names with a letter from the table to fix at least the first issue. Like TPCH: lineitem.l_key and orders.o_key.

A common approach to fix this is to explicitely specify all column names in your SELECT list.

However it looks like Hive has some cool/dirty tricks up its sleeve to make this easier. Regular Expressions to specify column names.

Testsetup:

describe tb1;
id                  	int
name                	string

describe tb2;
id                  	int
age			int

You then have to disable the use of quotes in identifiers because that interferes with the regex.

set hive.support.quoted.identifiers=none;

And now you can use Java Regex to select all columns from the right table that is NOT the key. So essentially you get all columns but the duplicate. If you have non-join key columns that are duplicate you can exclude them and rename them with the AS statement after:

create table tb3 asselect tb1.*, tb2.`(id)?+.+`from tb1, tb2 where tb1.id = tb2.id;

You can see that I select all columns from the left table and then use the `` quotes to specify a regular expression for the columns from the right side I want to use.The regex is essentially asking for any string unless it starts with id ( it essentially means "the string id once or not at all ( ?+) and any string following). This means if the whole string is id it will not be matched because the remainder of the regex needs to match something.

You could also specify multiple columns: (id|id2)?+.+ or (id*)?+.+.

This gives me a result table with all columns from the left table and all columns but the key column from the right table.

describe tb3;
id                  	int
name                	string
age                 	int

Hive is really cool.

mkumar13 · ‎07-06-2016

Thanks for a great tip, we can also specify multiple columns like this tb2.`(id1)?+(id2)?+.+`

bleonhardi · ‎07-06-2016

You mean to exclude two columns? That one would definitely work: (id1|id2)?+.+ Your version would say id1 once or not at all followed by id2 once or not at all followed by anything else. So should work too I think.

Cloudera Community

Community Articles

Excluding Duplicate Key Columns from Hive using Regular Expressions

Apache Hive

Re: Excluding Duplicate Key Columns from Hive using Regular Expressions

Re: Excluding Duplicate Key Columns from Hive using Regular Expressions