One common problem in SQL is that you join two tables and get duplicate column names from the two tables. When you now for example want to create a CTAS you will get "duplicate column name" errors. You also often want to exclude the join key from the result set since it is by definition duplicate.
Database schemas often prefix column names with a letter from the table to fix at least the first issue. Like TPCH: lineitem.l_key and orders.o_key.
A common approach to fix this is to explicitely specify all column names in your SELECT list.
However it looks like Hive has some cool/dirty tricks up its sleeve to make this easier. Regular Expressions to specify column names.
You then have to disable the use of quotes in identifiers because that interferes with the regex.
And now you can use Java Regex to select all columns from the right table that is NOT the key. So essentially you get all columns but the duplicate. If you have non-join key columns that are duplicate you can exclude them and rename them with the AS statement after:
You can see that I select all columns from the left table and then use the `` quotes to specify a regular expression for the columns from the right side I want to use.The regex is essentially asking for any string unless it starts with id ( it essentially means "the string id once or not at all ( ?+) and any string following). This means if the whole string is id it will not be matched because the remainder of the regex needs to match something.
You could also specify multiple columns: (id|id2)?+.+ or (id*)?+.+.
This gives me a result table with all columns from the left table and all columns but the key column from the right table.