I have a hive table to which new partitions get added (say daily). And I want to write a daily hive query that tells me which records changed or were added that day.
A unique record is a combination of multiple columns.
Would using hive's hash or sha (with 256) udf be the best and most performant route to writing such a query? And will using a 256 hash be good enough to prevent collisions?
One question? Why not simply add a "changed at" column?
For adding rows you could just use partitioning. And filter by the daily partiton?
For changing them you plan to use ACID? It is still pretty new and not great for high number of updates across a full table. I think a bit of more detail to what you actually plan to achieve might be good
Thanks Benjamin - assume the data is sqooped in from an EDW and we don't have the flexibility to add a timestamp/data column to the source table in the EDW.
Would comparing column hashes be the most performant way to figure out what records changed?
Yeah I am not really sure about the whole hash approach. Is there a primary key here?
Why not simply load in the batch of data and then recreate the base table using:
CREATE TABLE NEWTABLE AS
SELECT * FROM DAILYADD
SELECT * FROM OLD TABLE WHERE PRIMARYKEY NOT IN ( SELECT PRIMARY KEY FROM DAILYADD );
If you don't have primary keys then hash will not help you either, you might have two rows with the same values so what would you do then? What would you DO with the information of which rows have changed?
@nfakhar, I had recently created historical build for ETL offload to Hive. We added Effective start date and effective end date to each row. I agree with you adding hash function increases performance for full row compare. The newer version of Hive has Hash functions, but we had used datafu for hashing using PIG.