Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

SCD1 Implementation

SCD1 Implementation

Contributor

The process involved in the implementation of SCD Type 1 are

Identifying the new record and inserting it into the dimension table.

Identifying the changed record and updating the dimension table.

How to implement SCD1 in Hive?

Found below things and not sure how to get the complete solution:

a)we can Surrogate key using datafu.pig.hash.SHA(); in pig or

http://www.remay.com.br/blog/hdp-2-2-how-to-create-a-surrogate-key-on-hive/ in hive using row_number?

b)For change capture I can use full outer to identify new record and update record

c)To use update statement in hive we have to use transaction property+ORC Format

I want to do either in hive or pig

source and target is as below

12273-2017-02-09-18-06-11-book1-excel.png


2017-02-09-18-06-11-book1-excel.png
3 REPLIES 3
Highlighted

Re: SCD1 Implementation

Contributor

Hi experts,

Any input on my clarification

Re: SCD1 Implementation

New Contributor

Hello Vamsi,

This can be implemented in hive if you maintain history of data change adding timestamp column and use windowing function.

12788-result.png

lead(loaded_timestamp) over partition by name order by loaded_timestamp

--- over partition by <primary_key> which differentiates data

More on windowing functions: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics

Re: SCD1 Implementation

Hi vamsi valiveti

There are multiple ways as you have mentioned in the question. Let me explain a few in hive.

1) Full outer join :There should be primary key which will be available in the source and target which are same and help us to identify the new and updated records. After identifying it pick the columns from Left table for new and updated records and pick the columns from right table which has no update on the particular day and then overwrite the target. All this will be performed based on the flag which you will be deriving based on the primary key.

2) Your target is not necessarily to be in ORC format. You can even use any other native formats as well but you may need to truncate the target table before overwriting it by following the above approach in 2 steps. After performing full outer join load your data into a stage table and then ovewrite the target with you stage table data. By this way you can use any kind of file formats and not neccessarily to be in ORC.

Don't have an account?
Coming from Hortonworks? Activate your account here