Created 07-18-2017 12:00 AM
How to write and learn SCD 2 type in Hive and Pig?
Created 07-18-2017 01:33 AM
I think you probably realize what makes SCD type 2 difficult in Hadoop (hive/Pig) is that you cannot update records (With new Hive ACID you can but under the hood its doing the magic, that you can also do your self).
Rather than reprinting the process here, here is one link that describes implementing doing SCD Type 2 in Hadoop using Hive. Hope this helps.
https://www.softserveinc.com/en-us/tech/blogs/process-slowly-changing-dimensions-hive/
Created 07-18-2017 01:33 AM
I think you probably realize what makes SCD type 2 difficult in Hadoop (hive/Pig) is that you cannot update records (With new Hive ACID you can but under the hood its doing the magic, that you can also do your self).
Rather than reprinting the process here, here is one link that describes implementing doing SCD Type 2 in Hadoop using Hive. Hope this helps.
https://www.softserveinc.com/en-us/tech/blogs/process-slowly-changing-dimensions-hive/
Created 11-19-2018 10:21 PM
Here's the detailed implementation of slowly changing dimension type 2 in Hive using exclusive join approach.
Assuming that the source is sending a complete data file i.e. old, updated and new records.
Steps:
select * from HIST_TAB where exp_dt != '2099-12-31'
select hist.* from HIST_TAB hist
inner join STG_TAB stg
on hist.key = stg.key
where hist.column = stg.column
select stg.*, eff_dt (yyyy-MM-dd), exp_dt (2099-12-31)
from STG_TAB stg
left join
(select * from HIST_TAB where exp_dt = '2099-12-31') hist
on hist.key = stg.key
where hist.key is null
or hist.column != stg.column
select hist.*, exp_dt(yyyy-MM-dd) from
(select * from HIST_TAB where exp_dt = '2099-12-31') hist
left join STG_TAB stg
on hist.key= stg.key
where hist.key is null
or hist.column!= stg.column
unionall
queries from 2-5 and insert overwrite result to HIST tableMore detailed implementation of SCD type 2 can be found here-
https://github.com/sahilbhange/slowly-changing-dimension
Hope this helps!