Support Questions

Find answers, ask questions, and share your expertise

[HIVE] select a partitioned table and specify partition through a join

avatar

Hi,

I would to select a partitioned table (by YEAR, MONTH, DAY), but instead of writing "WHERE YEAR='2017' AND MONTH='12' AND DAY='11'", I would like make a join from this table to a table that contains each field YEAR, MONTH, DAY.

SELECT * FROM mypartitionedtable t1 INNER JOIN currentpartitiontable t2 ON t1.YEAR=t2.YEAR etc. etc.

But when I am doing an EXPLAIN EXTENDED, I see the analyzer will fetch every partition...

Is there something I missed ?

Thanks 🙂

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Hi @Sebastien F,

Couple of things needs to be considered to achieve this, as the optimizer will not be knowing till the execution of what are the values from "currentpartitiontables" (t2) is going to produced to generate the optimal explain plan.

Thus, we need to make sure that Cost base optimizer (CBO) is aware of the data profile to ensure optimal explain plan,

On the same context we know that t2 is very small, hence the optimizer should opt for map side join using the t2.

MAPJOINs are processed by loading the smaller table into an in-memory hash map and matching keys with the larger table as they are streamed through, we can provide the hints to hive so that it will be fully aware of the data demographics, two ways we can do this With Hints

    select /*+ MAPJOIN(t2) */ count(*) from
    mypartitionedtable t1 INNER JOIN currentpartitiontable t2 on t2 ON t1.YEAR=t2.YEAR etc. etc.
     

    With the Auto Join Conversion

    set hive.auto.convert.join=true;
    
    //When auto join is enabled, there is no longer a need to provide the map-join hints in the query. The auto join option can be enabled with two configuration parameters:
    
    set hive.auto.convert.join.noconditionaltask = true;
    set hive.auto.convert.join.noconditionaltask.size = 10000000;
    

    Ensure that statistics are updated to so that optimizer will across the data demographics.

    the other alternative is to pass the values pragmatically(much easier in some cases to achieve),

    If the values are fewer ( year,month and day ) the simplistic approach is to pragmatically retrive the values and pass as variables in user program ( might not be possible in all the cases ex: BI Utilities)

    more on the join optimization can be found in hive Language manual

    Hope this helps !!

    View solution in original post

    2 REPLIES 2

    avatar
    Super Collaborator

    Hi @Sebastien F,

    Couple of things needs to be considered to achieve this, as the optimizer will not be knowing till the execution of what are the values from "currentpartitiontables" (t2) is going to produced to generate the optimal explain plan.

    Thus, we need to make sure that Cost base optimizer (CBO) is aware of the data profile to ensure optimal explain plan,

    On the same context we know that t2 is very small, hence the optimizer should opt for map side join using the t2.

    MAPJOINs are processed by loading the smaller table into an in-memory hash map and matching keys with the larger table as they are streamed through, we can provide the hints to hive so that it will be fully aware of the data demographics, two ways we can do this With Hints

      select /*+ MAPJOIN(t2) */ count(*) from
      mypartitionedtable t1 INNER JOIN currentpartitiontable t2 on t2 ON t1.YEAR=t2.YEAR etc. etc.
       

      With the Auto Join Conversion

      set hive.auto.convert.join=true;
      
      //When auto join is enabled, there is no longer a need to provide the map-join hints in the query. The auto join option can be enabled with two configuration parameters:
      
      set hive.auto.convert.join.noconditionaltask = true;
      set hive.auto.convert.join.noconditionaltask.size = 10000000;
      

      Ensure that statistics are updated to so that optimizer will across the data demographics.

      the other alternative is to pass the values pragmatically(much easier in some cases to achieve),

      If the values are fewer ( year,month and day ) the simplistic approach is to pragmatically retrive the values and pass as variables in user program ( might not be possible in all the cases ex: BI Utilities)

      more on the join optimization can be found in hive Language manual

      Hope this helps !!

      avatar

      Thanks a lot @bkosaraju it really helps me 🙂