Created 12-11-2017 06:24 PM
Hi,
I would to select a partitioned table (by YEAR, MONTH, DAY), but instead of writing "WHERE YEAR='2017' AND MONTH='12' AND DAY='11'", I would like make a join from this table to a table that contains each field YEAR, MONTH, DAY.
SELECT * FROM mypartitionedtable t1 INNER JOIN currentpartitiontable t2 ON t1.YEAR=t2.YEAR etc. etc.
But when I am doing an EXPLAIN EXTENDED, I see the analyzer will fetch every partition...
Is there something I missed ?
Thanks 🙂
Created 12-11-2017 11:01 PM
Hi @Sebastien F,
Couple of things needs to be considered to achieve this, as the optimizer will not be knowing till the execution of what are the values from "currentpartitiontables" (t2) is going to produced to generate the optimal explain plan.
Thus, we need to make sure that Cost base optimizer (CBO) is aware of the data profile to ensure optimal explain plan,
On the same context we know that t2 is very small, hence the optimizer should opt for map side join using the t2.
MAPJOINs are processed by loading the smaller table into an in-memory hash map and matching keys with the larger table as they are streamed through, we can provide the hints to hive so that it will be fully aware of the data demographics, two ways we can do this With Hints
select /*+ MAPJOIN(t2) */ count(*) from mypartitionedtable t1 INNER JOIN currentpartitiontable t2 on t2 ON t1.YEAR=t2.YEAR etc. etc.
With the Auto Join Conversion
set hive.auto.convert.join=true; //When auto join is enabled, there is no longer a need to provide the map-join hints in the query. The auto join option can be enabled with two configuration parameters: set hive.auto.convert.join.noconditionaltask = true; set hive.auto.convert.join.noconditionaltask.size = 10000000;
Ensure that statistics are updated to so that optimizer will across the data demographics.
the other alternative is to pass the values pragmatically(much easier in some cases to achieve),
If the values are fewer ( year,month and day ) the simplistic approach is to pragmatically retrive the values and pass as variables in user program ( might not be possible in all the cases ex: BI Utilities)
more on the join optimization can be found in hive Language manual
Hope this helps !!
Created 12-11-2017 11:01 PM
Hi @Sebastien F,
Couple of things needs to be considered to achieve this, as the optimizer will not be knowing till the execution of what are the values from "currentpartitiontables" (t2) is going to produced to generate the optimal explain plan.
Thus, we need to make sure that Cost base optimizer (CBO) is aware of the data profile to ensure optimal explain plan,
On the same context we know that t2 is very small, hence the optimizer should opt for map side join using the t2.
MAPJOINs are processed by loading the smaller table into an in-memory hash map and matching keys with the larger table as they are streamed through, we can provide the hints to hive so that it will be fully aware of the data demographics, two ways we can do this With Hints
select /*+ MAPJOIN(t2) */ count(*) from mypartitionedtable t1 INNER JOIN currentpartitiontable t2 on t2 ON t1.YEAR=t2.YEAR etc. etc.
With the Auto Join Conversion
set hive.auto.convert.join=true; //When auto join is enabled, there is no longer a need to provide the map-join hints in the query. The auto join option can be enabled with two configuration parameters: set hive.auto.convert.join.noconditionaltask = true; set hive.auto.convert.join.noconditionaltask.size = 10000000;
Ensure that statistics are updated to so that optimizer will across the data demographics.
the other alternative is to pass the values pragmatically(much easier in some cases to achieve),
If the values are fewer ( year,month and day ) the simplistic approach is to pragmatically retrive the values and pass as variables in user program ( might not be possible in all the cases ex: BI Utilities)
more on the join optimization can be found in hive Language manual
Hope this helps !!
Created 12-17-2017 04:43 PM
Thanks a lot @bkosaraju it really helps me 🙂