08-03-2017 09:35 AM
This will be probably a newbie question! :)
I want to update my dimension tables which will be enriched after the fact ingestion.
What techniques are available to "update" data in Avro and Parquet files? I am aware that they do not support updates but I have seen two techniques being used.
The first one is to delete the file and recreate it completely with the new data.
The second one is to create a new file with the new data and at query time use a join to get the new information. If I have two versions of the same data, say in file A I have version 1 and in file B I have version 2. I want to use version 2.
I could easyly use this version number in the join clause, but I would need to known in advance what version to use. I am interested about this second technique. How is it done in the field?
08-03-2017 11:29 AM
The third option would be, maintain a unique version number for each record (in the same table) based on primary key combination and increase whenever you insert latest version of data. Also get the record with max version number and it will be your latest record. (so that you don't need to create new table/file everytime a record updated)
There are few more methods available (you can follow CDC methods) and i've shared the details in this link