The process involved in the implementation of SCD Type 1 are
Identifying the new record and inserting it into the dimension table.
Identifying the changed record and updating the dimension table.
How to implement SCD1 in Hive?
Found below things and not sure how to get the complete solution:
a)we can Surrogate key using datafu.pig.hash.SHA(); in pig or
http://www.remay.com.br/blog/hdp-2-2-how-to-create-a-surrogate-key-on-hive/ in hive using row_number?
b)For change capture I can use full outer to identify new record and update record
c)To use update statement in hive we have to use transaction property+ORC Format
I want to do either in hive or pig
source and target is as below
This can be implemented in hive if you maintain history of data change adding timestamp column and use windowing function.
lead(loaded_timestamp) over partition by name order by loaded_timestamp
--- over partition by <primary_key> which differentiates data
More on windowing functions: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
There are multiple ways as you have mentioned in the question. Let me explain a few in hive.
1) Full outer join :There should be primary key which will be available in the source and target which are same and help us to identify the new and updated records. After identifying it pick the columns from Left table for new and updated records and pick the columns from right table which has no update on the particular day and then overwrite the target. All this will be performed based on the flag which you will be deriving based on the primary key.
2) Your target is not necessarily to be in ORC format. You can even use any other native formats as well but you may need to truncate the target table before overwriting it by following the above approach in 2 steps. After performing full outer join load your data into a stage table and then ovewrite the target with you stage table data. By this way you can use any kind of file formats and not neccessarily to be in ORC.