Created 02-23-2017 07:57 PM
I have a pretty standard set up where i have a table which is partitioned by date (daily) and contains a snapshot of the data as of that day. As more and more data/days are added often i need to be able to access the latest snapshot/partition to perform some further processing. I am finding that all the queries i have written so far perform a full table scan in order to get the value of the max partition key and then this value is used to access the exact partition. No matter what i tried i could not get the optimizer to not perform this full table scan. I have resorted to creating a UDF which will access the HIVE catalog, list all of the partitions for this table, return a string with the value of the largest partition and use the return value of this partition as the filter in teh where clause. This UDF approach performs MUCH MUCH better. Is there another way of doing this which does not require the creation of a UDF? This seems like a reasonable optimization to build into HIVE or at least have the optimizer perform similar steps as i described for the UDF.
Created 02-23-2017 08:02 PM
@Marcos Da Silva a good discussion on the topic https://community.hortonworks.com/questions/29031/best-pratices-for-hive-partitioning-especially-by....
Created 02-23-2017 08:10 PM
@Artem ErvitsThanks for your prompt response. I had read that post before i sent mine and i believe the post you sent focuses on a slightly different issue(if and how to partition) as opposed to how to optimize accessing the "latest" [max(<partition key>) ]partition.
I'm sorry if i missed it, but do you mind pointing out where my question is addressed in the link you posted?
Created 02-23-2017 08:35 PM
@Marcos Da Silva no you're right, I misunderstood your question.
Created 03-10-2017 01:07 AM
could you please provide create table statement and select query that invokes full table scan?
also please run explain and let's see output here. that will definitely help to understand the issue.
Created 08-04-2017 06:19 PM
Created 10-03-2017 05:38 PM
This should solve the problem as it did for mine.
select column1,column2 from table where partition_column in (select max(distinct partition_column) from table)"
Created 04-16-2019 04:31 AM
@Marcos Da Silva , we are having the same problem and could you please elaborate on the UDF you are talking about and if possible give the code snippet.
Created 04-16-2019 04:31 AM
@Marcos Da Silva , we are having the same problem and could you please elaborate on the UDF you are talking about and if possible give the code snippet.
Created 04-28-2019 02:56 PM
We can use rank approach which is faster than max , max scans the table twice:
Here , partition column is load_date:
select
ld_dt.txnno ,
ld_dt.txndate ,
ld_dt.custno ,
ld_dt.amount ,
ld_dt.productno ,
ld_dt.spendby ,
ld_dt.load_date
from
(select *,dense_rank() over (order by load_date desc) dt_rnk from datastore_s2.transactions)ld_dt
where ld_dt.dt_rnk=1