Created 01-05-2016 10:10 PM
Using Hive on Tez running this query against this view causes a full table scan even though there is a Partition on regionid and id. This query in Cloudera Impala takes 0.6s to complete and using Hortonworks Data Platform and Hive on Tez it takes 800s. I've come to the conclusion that in Hive on Tez using a window function prevents the predicate to be pushed down to the inner select causing the full table scan.
CREATE VIEW latestposition AS WITH t1 AS ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY regionid, id, deviceid order by ts desc) AS rownos FROM positions ) SELECT * FROM t1 WHERE rownos = 1; SELECT * FROM latestposition WHERE regionid='1d6a0be1-6366-4692-9597-ebd5cd0f01d1' and id=1422792010 and deviceid='6c5d1a30-2331-448b-a726-a380d6b3a432';
I've tried joining this table to itself using the MAX function to get the latest record, it works and finishes in a few seconds but it still is too slow for my use case. Also if I remove the window function the predicate gets pushed down and this will return in milliseconds. If anyone has any ideas it would be much appreciated.
Created 01-07-2016 11:47 PM
@Ryan Tomczik: I can confirm this as a bug, even without views.
Filed https://issues.apache.org/jira/browse/HIVE-12808
Please have a look & see if that describes the issue clearly.
Created 01-06-2016 12:45 AM
@Ryan Tomczik Can you add the output to explain extended, i.e.
set explain_level=extended; explain SELECT * FROM latestposition WHERE regionid='1d6a0be1-6366-4692-9597-ebd5cd0f01d1' and id=1422792010 and deviceid='6c5d1a30-2331-448b-a726-a380d6b3a432';
Created 01-06-2016 08:14 AM
hive> set explain_level=extended;
hive> explain SELECT * FROM latestposition WHERE regionid='1d6a0be1-6366-4692-9597-ebd5cd0f01d1' and id=1422792010 and deviceid='6c5d1a30-2331-448b-a726-a380d6b3a432';
OK Plan not optimized by CBO. Vertex dependency in root stage Reducer 2 <- Map 1 (SIMPLE_EDGE) Stage-0 Fetch Operator limit:-1 Stage-1 Reducer 2 File Output Operator [FS_10] compressed:false Statistics:Num rows: 3 Data size: 3264 Basic stats: COMPLETE Column stats: COMPLETE table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"} Select Operator [SEL_4] outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11"] Statistics:Num rows: 3 Data size: 3264 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator [FIL_11] predicate:((((ROW_NUMBER_window_0 = 1) and (_col9 = '1d6a0be1-6366-4692-9597-ebd5cd0f01d1')) and (_col10 = 1422792010)) and (_col0 = '6c5d1a30-2331-448b-a726-a380d6b3a432')) (type: boolean) Statistics:Num rows: 3 Data size: 3264 Basic stats: COMPLETE Column stats: COMPLETE PTF Operator [PTF_3] Function definitions:[{"Input definition":{"type:":"WINDOWING"}},{"order by:":"_col10(DESC)","name:":"windowingtablefunction","partition by:":"_col9, _col10, _col0"}] Statistics:Num rows: 19535311 Data size: 21254418368 Basic stats: COMPLETE Column stats: COMPLETE Select Operator [SEL_2] | outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10"] | Statistics:Num rows: 19535311 Data size: 21254418368 Basic stats: COMPLETE Column stats: COMPLETE |<-Map 1 [SIMPLE_EDGE] vectorized Reduce Output Operator [RS_12] key expressions:regionid (type: string), id (type: int), deviceid (type: string) Map-reduce partition columns:regionid (type: string), id (type: int), deviceid (type: string) sort order:+-+ Statistics:Num rows: 19535311 Data size: 17581916530 Basic stats: COMPLETE Column stats: COMPLETE value expressions:datetime (type: string), radius (type: string), x (type: string), y (type: string), classifier (type: string), categoryid (type: string), accountid (type: string), mapid (type: string) TableScan [TS_0] alias:pos_regions_orc2 Statistics:Num rows: 19535311 Data size: 17581916530 Basic stats: COMPLETE Column stats: COMPLETE Time taken: 0.774 seconds, Fetched: 37 row(s)
Created 01-06-2016 08:46 AM
The ideal rewrite for such a query is
select * from latestposition where regionid='1d6a0be1-6366-4692-9597-ebd5cd0f01d1'and id=1422792010and deviceid='6c5d1a30-2331-448b-a726-a380d6b3a432' order by ts limit 1;
is the table partitioned on ts?
Created 01-06-2016 08:48 PM
Hive currently doesn't push predicates in to row_number; but it does push it into rank.
May be you want to try rewriting view definition as
WITH t1 AS ( SELECT *, rank() OVER ( PARTITION BY regionid, id, deviceid order by ts desc) AS rownos FROM positions ) SELECT distinct * FROM t1 WHERE rownos < 2;
This will cause Hive to push limit in to Windowing Fns.
Created 01-07-2016 10:42 AM
I've tried this and it still doesn't seem to push down into the Windowing Function. I'm still getting the full table scan at the beginning of my execution plan.
Created 01-08-2016 12:58 AM
You need to set hive.limit.pushdown.memory.usage;
may be se to 0.3.
This would still result in Table scan but the data that needs to be munched by other operators in the pipeline would be low.
Created 01-08-2016 10:28 PM
I tried "set hive.limit.pushdown.memory.usage=0.3;" and running the using Rank but it still take 800s.
Thanks for taking up this ticket so quickly! Once you finish your patch will it be available on HIVE-12808? If so, would it be possible to clone the Hive repo and apply your patch and compile and test it?
Created 01-07-2016 11:47 PM
@Ryan Tomczik: I can confirm this as a bug, even without views.
Filed https://issues.apache.org/jira/browse/HIVE-12808
Please have a look & see if that describes the issue clearly.
Created 10-15-2017 07:25 AM
We also faced a similar issue and this approach helped me to push down subquery even having "WITH" clause. This even works in Hive 1.2.1
https://stackoverflow.com/questions/13523049/hive-sql-find-the-latest-record
The limitation is that this is
a) only useful for MAX logic.
b) We can't have the flexibility of having windowing different ORDERS, .ie. col1 desc, col2 asc, col3 desc etc
Thanks,
Sarath