About marcos_dasilva

marcos_dasilva · ‎02-23-2017

@Artem ErvitsThanks for your prompt response. I had read that post before i sent mine and i believe the post you sent focuses on a slightly different issue(if and how to partition) as opposed to how to optimize accessing the "latest" [max(<partition key>) ]partition. I'm sorry if i missed it, but do you mind pointing out where my question is addressed in the link you posted?

marcos_dasilva · ‎02-23-2017

I have a pretty standard set up where i have a table which is partitioned by date (daily) and contains a snapshot of the data as of that day. As more and more data/days are added often i need to be able to access the latest snapshot/partition to perform some further processing. I am finding that all the queries i have written so far perform a full table scan in order to get the value of the max partition key and then this value is used to access the exact partition. No matter what i tried i could not get the optimizer to not perform this full table scan. I have resorted to creating a UDF which will access the HIVE catalog, list all of the partitions for this table, return a string with the value of the largest partition and use the return value of this partition as the filter in teh where clause. This UDF approach performs MUCH MUCH better. Is there another way of doing this which does not require the creation of a UDF? This seems like a reasonable optimization to build into HIVE or at least have the optimizer perform similar steps as i described for the UDF.

Online	Offline
Last Visited	‎08-02-2017 07:23 PM

Member Since	‎02-23-2017 03:27 PM
Last Visited	‎08-02-2017 07:23 PM
Posts	4
Kudos received	3

Cloudera Community

Re: How to optimize HIVE access to the "latest" pa...

How to optimize HIVE access to the "latest" partit...