Created on 06-14-2017 02:49 AM - edited 08-17-2019 09:21 PM
I am working with the HDP2.3 Rev6 VM for a self paced course.
I am getting the below errors for the same query, when using aliases.
select sum(ordertotal), year(order_date) from orders group by year(order_date)
this query works fine.
But if I use aliases, it fails.
Am I missing something?
Regards,
Anirban.
Created 06-14-2017 07:33 PM
While some databases like MySql and Teradata allow using column alias in GROUP BY clause, Hive does not allow this to the best of my knowledge.
Conceptually, GROUP BY processing happens before SELECT list computation, so it would be circular to allow such references.
To understand this better, use the EXPLAIN feature in Hive for your query. That will give you a logical breakdown of your query and will reveal how it is being processed & in what order.
Here is the link from official wiki - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain
The above link demonstrates group by example and you will see GROUP BY processing happens before SELECT, thereby leading to the error you are facing.
Workaround
If you still want to write queries where you want to group by derived column, you can do it using this nifty technique.
Set the following property:
SET hive.groupby.orderby.position.alias=true;
This will allow you to group by columns based on the position in the select clause.
So instead of writing your query like this:
select sum(ordertotal), year(order_date) from orders group by year(order_date)
You can write the group by clause using position as shown below:
select sum(ordertotal), year(order_date) from orders group by 2
This will get you the desired result without having to repeat "year(order_date)" in group by clause.
As always, if this answers helps you, please consider accepting it.
Created 06-14-2017 07:33 PM
While some databases like MySql and Teradata allow using column alias in GROUP BY clause, Hive does not allow this to the best of my knowledge.
Conceptually, GROUP BY processing happens before SELECT list computation, so it would be circular to allow such references.
To understand this better, use the EXPLAIN feature in Hive for your query. That will give you a logical breakdown of your query and will reveal how it is being processed & in what order.
Here is the link from official wiki - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain
The above link demonstrates group by example and you will see GROUP BY processing happens before SELECT, thereby leading to the error you are facing.
Workaround
If you still want to write queries where you want to group by derived column, you can do it using this nifty technique.
Set the following property:
SET hive.groupby.orderby.position.alias=true;
This will allow you to group by columns based on the position in the select clause.
So instead of writing your query like this:
select sum(ordertotal), year(order_date) from orders group by year(order_date)
You can write the group by clause using position as shown below:
select sum(ordertotal), year(order_date) from orders group by 2
This will get you the desired result without having to repeat "year(order_date)" in group by clause.
As always, if this answers helps you, please consider accepting it.
Created 06-15-2017 07:04 AM
thank you @Dinesh Chitlangia
that explains it very well.
Created 06-14-2017 08:22 PM
Couldnt Agree More with @Dinesh Chitlangia
Same is the case with WHERE clause.
If you have alias in WHERE clause we will facing similarerror.
Reason being here clause is evaluated before the select clause,
select col1 as c1, col2 as c2 from test_table where c1 = "ABC";
we make it work by re tweaking above query as
select * from ( select col1 as c1, col2 as c2 from test_table ) t1 where c1 = "ABC";