Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Irregularities in Select query

avatar
Contributor

I am working with the HDP2.3 Rev6 VM for a self paced course.

I am getting the below errors for the same query, when using aliases.

16299-hive-select.jpg

select sum(ordertotal), year(order_date) from orders group by year(order_date)

this query works fine.

But if I use aliases, it fails.

Am I missing something?

Regards,

Anirban.

1 ACCEPTED SOLUTION

avatar

While some databases like MySql and Teradata allow using column alias in GROUP BY clause, Hive does not allow this to the best of my knowledge.

Conceptually, GROUP BY processing happens before SELECT list computation, so it would be circular to allow such references.

To understand this better, use the EXPLAIN feature in Hive for your query. That will give you a logical breakdown of your query and will reveal how it is being processed & in what order.

Here is the link from official wiki - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain

The above link demonstrates group by example and you will see GROUP BY processing happens before SELECT, thereby leading to the error you are facing.

Workaround

If you still want to write queries where you want to group by derived column, you can do it using this nifty technique.

Set the following property:

SET hive.groupby.orderby.position.alias=true;

This will allow you to group by columns based on the position in the select clause.

So instead of writing your query like this:

select sum(ordertotal), year(order_date) from orders group by year(order_date)

You can write the group by clause using position as shown below:

select sum(ordertotal), year(order_date) from orders group by 2

This will get you the desired result without having to repeat "year(order_date)" in group by clause.

As always, if this answers helps you, please consider accepting it.

View solution in original post

3 REPLIES 3

avatar

While some databases like MySql and Teradata allow using column alias in GROUP BY clause, Hive does not allow this to the best of my knowledge.

Conceptually, GROUP BY processing happens before SELECT list computation, so it would be circular to allow such references.

To understand this better, use the EXPLAIN feature in Hive for your query. That will give you a logical breakdown of your query and will reveal how it is being processed & in what order.

Here is the link from official wiki - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain

The above link demonstrates group by example and you will see GROUP BY processing happens before SELECT, thereby leading to the error you are facing.

Workaround

If you still want to write queries where you want to group by derived column, you can do it using this nifty technique.

Set the following property:

SET hive.groupby.orderby.position.alias=true;

This will allow you to group by columns based on the position in the select clause.

So instead of writing your query like this:

select sum(ordertotal), year(order_date) from orders group by year(order_date)

You can write the group by clause using position as shown below:

select sum(ordertotal), year(order_date) from orders group by 2

This will get you the desired result without having to repeat "year(order_date)" in group by clause.

As always, if this answers helps you, please consider accepting it.

avatar
Contributor

thank you @Dinesh Chitlangia

that explains it very well.

avatar
Super Collaborator

Couldnt Agree More with @Dinesh Chitlangia

Same is the case with WHERE clause.

If you have alias in WHERE clause we will facing similarerror.

Reason being here clause is evaluated before the select clause,

select col1 as c1, col2 as c2 from test_table where c1 = "ABC";

we make it work by re tweaking above query as

select * from ( select col1 as c1, col2 as c2 from test_table ) t1 where c1 = "ABC";