Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Best and Easy way to implement and create SCD2 in Hive and in Pig?

avatar
New Contributor

How to write and learn SCD 2 type in Hive and Pig?

1 ACCEPTED SOLUTION

avatar
Super Guru
@Upendra N

I think you probably realize what makes SCD type 2 difficult in Hadoop (hive/Pig) is that you cannot update records (With new Hive ACID you can but under the hood its doing the magic, that you can also do your self).

Rather than reprinting the process here, here is one link that describes implementing doing SCD Type 2 in Hadoop using Hive. Hope this helps.

https://www.softserveinc.com/en-us/tech/blogs/process-slowly-changing-dimensions-hive/

View solution in original post

2 REPLIES 2

avatar
Super Guru
@Upendra N

I think you probably realize what makes SCD type 2 difficult in Hadoop (hive/Pig) is that you cannot update records (With new Hive ACID you can but under the hood its doing the magic, that you can also do your self).

Rather than reprinting the process here, here is one link that describes implementing doing SCD Type 2 in Hadoop using Hive. Hope this helps.

https://www.softserveinc.com/en-us/tech/blogs/process-slowly-changing-dimensions-hive/

avatar
New Contributor

Here's the detailed implementation of slowly changing dimension type 2 in Hive using exclusive join approach.

Assuming that the source is sending a complete data file i.e. old, updated and new records.

Steps:

  1. Load the recent file data to STG table
  2. Select all the expired records from HIST table

    select * from HIST_TAB where exp_dt != '2099-12-31'

  3. Select all the records which are not changed from STG and HIST using inner join and filter on HIST.column = STG.column as below

    select hist.* from HIST_TAB hist inner join STG_TAB stg on hist.key = stg.key where hist.column = stg.column

  4. Select all the new and updated records which are changed from STG_TAB using exclusive left join with HIST_TAB and set expiry and effective date as below

    select stg.*, eff_dt (yyyy-MM-dd), exp_dt (2099-12-31) from STG_TAB stg left join (select * from HIST_TAB where exp_dt = '2099-12-31') hist
    on hist.key = stg.key where hist.key is null or hist.column != stg.column

  5. Select all updated old records from the HIST table using exclusive left join with STG table and set their expiry date as shown below:

    select hist.*, exp_dt(yyyy-MM-dd) from (select * from HIST_TAB where exp_dt = '2099-12-31') hist left join STG_TAB stg
    on hist.key= stg.key where hist.key is null or hist.column!= stg.column

  6. unionall queries from 2-5 and insert overwrite result to HIST table

More detailed implementation of SCD type 2 can be found here-

https://github.com/sahilbhange/slowly-changing-dimension

Hope this helps!