Support Questions

Find answers, ask questions, and share your expertise

How to optimize HIVE access to the "latest" partition

avatar

I have a pretty standard set up where i have a table which is partitioned by date (daily) and contains a snapshot of the data as of that day. As more and more data/days are added often i need to be able to access the latest snapshot/partition to perform some further processing. I am finding that all the queries i have written so far perform a full table scan in order to get the value of the max partition key and then this value is used to access the exact partition. No matter what i tried i could not get the optimizer to not perform this full table scan. I have resorted to creating a UDF which will access the HIVE catalog, list all of the partitions for this table, return a string with the value of the largest partition and use the return value of this partition as the filter in teh where clause. This UDF approach performs MUCH MUCH better. Is there another way of doing this which does not require the creation of a UDF? This seems like a reasonable optimization to build into HIVE or at least have the optimizer perform similar steps as i described for the UDF.

9 REPLIES 9

avatar
Master Mentor

avatar

@Artem ErvitsThanks for your prompt response. I had read that post before i sent mine and i believe the post you sent focuses on a slightly different issue(if and how to partition) as opposed to how to optimize accessing the "latest" [max(<partition key>) ]partition.

I'm sorry if i missed it, but do you mind pointing out where my question is addressed in the link you posted?

avatar
Master Mentor

@Marcos Da Silva no you're right, I misunderstood your question.

avatar
Super Collaborator

@Marcos Da Silva

could you please provide create table statement and select query that invokes full table scan?

also please run explain and let's see output here. that will definitely help to understand the issue.

avatar
New Contributor

Hi @Marcos Da Silva,

I have the exact same problem. Could you please share your UDF?

Cheers,

Alexey

avatar
Expert Contributor

@Marcos Da Silva

This should solve the problem as it did for mine.

select column1,column2 from table where partition_column in 
(select max(distinct partition_column) from table)"

avatar
Contributor

@Marcos Da Silva , we are having the same problem and could you please elaborate on the UDF you are talking about and if possible give the code snippet.

avatar
Contributor

@Marcos Da Silva , we are having the same problem and could you please elaborate on the UDF you are talking about and if possible give the code snippet.

avatar
New Contributor

We can use rank approach which is faster than max , max scans the table twice:

Here , partition column is load_date:

select

ld_dt.txnno ,

ld_dt.txndate ,

ld_dt.custno ,

ld_dt.amount ,

ld_dt.productno ,

ld_dt.spendby ,

ld_dt.load_date

from

(select *,dense_rank() over (order by load_date desc) dt_rnk from datastore_s2.transactions)ld_dt

where ld_dt.dt_rnk=1