Support Questions

Find answers, ask questions, and share your expertise

Hive on Tez Pushdown Predicate doesn't work in view using window function on partitioned table

avatar

Using Hive on Tez running this query against this view causes a full table scan even though there is a Partition on regionid and id. This query in Cloudera Impala takes 0.6s to complete and using Hortonworks Data Platform and Hive on Tez it takes 800s. I've come to the conclusion that in Hive on Tez using a window function prevents the predicate to be pushed down to the inner select causing the full table scan.

    CREATE VIEW latestposition AS
    WITH t1 AS (
      SELECT *, ROW_NUMBER() OVER ( PARTITION BY regionid, id, deviceid order by ts desc) AS rownos FROM positions 
    )
    SELECT *
    FROM t1
    WHERE rownos = 1;
 
    SELECT * FROM latestposition WHERE  regionid='1d6a0be1-6366-4692-9597-ebd5cd0f01d1' and id=1422792010 and deviceid='6c5d1a30-2331-448b-a726-a380d6b3a432';

I've tried joining this table to itself using the MAX function to get the latest record, it works and finishes in a few seconds but it still is too slow for my use case. Also if I remove the window function the predicate gets pushed down and this will return in milliseconds. If anyone has any ideas it would be much appreciated.

1 ACCEPTED SOLUTION

avatar
Expert Contributor

@Ryan Tomczik: I can confirm this as a bug, even without views.

Filed https://issues.apache.org/jira/browse/HIVE-12808

Please have a look & see if that describes the issue clearly.

View solution in original post

9 REPLIES 9

avatar

@Ryan Tomczik Can you add the output to explain extended, i.e.

set explain_level=extended;
explain SELECT * FROM latestposition WHERE  regionid='1d6a0be1-6366-4692-9597-ebd5cd0f01d1' and id=1422792010 and deviceid='6c5d1a30-2331-448b-a726-a380d6b3a432';

avatar
hive> set explain_level=extended; 
hive> explain SELECT * FROM latestposition WHERE  regionid='1d6a0be1-6366-4692-9597-ebd5cd0f01d1' and id=1422792010 and deviceid='6c5d1a30-2331-448b-a726-a380d6b3a432'; 
OK
Plan not optimized by CBO.
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Stage-0
   Fetch Operator
      limit:-1
      Stage-1
         Reducer 2
         File Output Operator [FS_10]
            compressed:false
            Statistics:Num rows: 3 Data size: 3264 Basic stats: COMPLETE Column stats: COMPLETE
            table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
            Select Operator [SEL_4]
               outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11"]
               Statistics:Num rows: 3 Data size: 3264 Basic stats: COMPLETE Column stats: COMPLETE
               Filter Operator [FIL_11]
                  predicate:((((ROW_NUMBER_window_0 = 1) and (_col9 = '1d6a0be1-6366-4692-9597-ebd5cd0f01d1')) and (_col10 = 1422792010)) and (_col0 = '6c5d1a30-2331-448b-a726-a380d6b3a432')) (type: boolean)
                  Statistics:Num rows: 3 Data size: 3264 Basic stats: COMPLETE Column stats: COMPLETE
                  PTF Operator [PTF_3]
                     Function definitions:[{"Input definition":{"type:":"WINDOWING"}},{"order by:":"_col10(DESC)","name:":"windowingtablefunction","partition by:":"_col9, _col10, _col0"}]
                     Statistics:Num rows: 19535311 Data size: 21254418368 Basic stats: COMPLETE Column stats: COMPLETE
                     Select Operator [SEL_2]
                     |  outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10"]
                     |  Statistics:Num rows: 19535311 Data size: 21254418368 Basic stats: COMPLETE Column stats: COMPLETE
                     |<-Map 1 [SIMPLE_EDGE] vectorized
                        Reduce Output Operator [RS_12]
                           key expressions:regionid (type: string), id (type: int), deviceid (type: string)
                           Map-reduce partition columns:regionid (type: string), id (type: int), deviceid (type: string)
                           sort order:+-+
                           Statistics:Num rows: 19535311 Data size: 17581916530 Basic stats: COMPLETE Column stats: COMPLETE
                           value expressions:datetime (type: string), radius (type: string), x (type: string), y (type: string), classifier (type: string), categoryid (type: string), accountid (type: string), mapid (type: string)
                           TableScan [TS_0]
                              alias:pos_regions_orc2
                              Statistics:Num rows: 19535311 Data size: 17581916530 Basic stats: COMPLETE Column stats: COMPLETE
Time taken: 0.774 seconds, Fetched: 37 row(s)

avatar
Expert Contributor

The ideal rewrite for such a query is

select * from latestposition where regionid='1d6a0be1-6366-4692-9597-ebd5cd0f01d1'and id=1422792010and deviceid='6c5d1a30-2331-448b-a726-a380d6b3a432' order by ts limit 1;

is the table partitioned on ts?

avatar
New Contributor

Hive currently doesn't push predicates in to row_number; but it does push it into rank.

May be you want to try rewriting view definition as

WITH t1 AS ( SELECT *, rank() OVER ( PARTITION BY regionid, id, deviceid order by ts desc) AS rownos FROM positions ) SELECT distinct * FROM t1 WHERE rownos < 2;

This will cause Hive to push limit in to Windowing Fns.

avatar

I've tried this and it still doesn't seem to push down into the Windowing Function. I'm still getting the full table scan at the beginning of my execution plan.

avatar
New Contributor

You need to set hive.limit.pushdown.memory.usage;

may be se to 0.3.

This would still result in Table scan but the data that needs to be munched by other operators in the pipeline would be low.

avatar

I tried "set hive.limit.pushdown.memory.usage=0.3;" and running the using Rank but it still take 800s.

Thanks for taking up this ticket so quickly! Once you finish your patch will it be available on HIVE-12808? If so, would it be possible to clone the Hive repo and apply your patch and compile and test it?

avatar
Expert Contributor

@Ryan Tomczik: I can confirm this as a bug, even without views.

Filed https://issues.apache.org/jira/browse/HIVE-12808

Please have a look & see if that describes the issue clearly.

avatar
New Contributor

Hi @Ryan Tomczik

We also faced a similar issue and this approach helped me to push down subquery even having "WITH" clause. This even works in Hive 1.2.1

https://stackoverflow.com/questions/13523049/hive-sql-find-the-latest-record

The limitation is that this is

a) only useful for MAX logic.

b) We can't have the flexibility of having windowing different ORDERS, .ie. col1 desc, col2 asc, col3 desc etc

Thanks,

Sarath