Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Using hash functions in hive to figure out which records changed/added to a large table

avatar
New Contributor

I have a hive table to which new partitions get added (say daily). And I want to write a daily hive query that tells me which records changed or were added that day.

A unique record is a combination of multiple columns.

Would using hive's hash or sha (with 256) udf be the best and most performant route to writing such a query? And will using a 256 hash be good enough to prevent collisions?

5 REPLIES 5

avatar
Master Guru

One question? Why not simply add a "changed at" column?

For adding rows you could just use partitioning. And filter by the daily partiton?

For changing them you plan to use ACID? It is still pretty new and not great for high number of updates across a full table. I think a bit of more detail to what you actually plan to achieve might be good

avatar
New Contributor

Thanks Benjamin - assume the data is sqooped in from an EDW and we don't have the flexibility to add a timestamp/data column to the source table in the EDW.

Would comparing column hashes be the most performant way to figure out what records changed?

avatar
Master Guru

Yeah I am not really sure about the whole hash approach. Is there a primary key here?

Why not simply load in the batch of data and then recreate the base table using:

CREATE TABLE NEWTABLE AS

SELECT * FROM DAILYADD

UNION ALL

SELECT * FROM OLD TABLE WHERE PRIMARYKEY NOT IN ( SELECT PRIMARY KEY FROM DAILYADD );

If you don't have primary keys then hash will not help you either, you might have two rows with the same values so what would you do then? What would you DO with the information of which rows have changed?

avatar
Super Collaborator

@nfakhar, I had recently created historical build for ETL offload to Hive. We added Effective start date and effective end date to each row. I agree with you adding hash function increases performance for full row compare. The newer version of Hive has Hash functions, but we had used datafu for hashing using PIG.

avatar
Master Guru