We have Datawarehouse residing in Data lake. At present we are using Hive to build and access facts and Dimensions. But these dimensions follows SCD type 2 implementations. It's similar to typical star schema which we have followed. Though fact tables are de-normalized and has important columns captured from other dimensions, there are times when we need to perform join between dimensions. In such case is it ok to build the dimension tables in HBase as it can easily handle data compaction.What will be the pros and Cons of using Hbase to build dimensions?
I am looking into something very similar but what I have found is using Hive LLAP/ACID (merge feature) is the right way to go. Here is what I know so far, from talking to a coworker who has done couple of successful POC for SCD Type 2 .
Before reading below, please see the following link (video starts at 20:20 which is where presenter talks about SCD Type 2 example):
Here is the approach on implementing SCD Type 2:
The loading of the data was done by exporting existing data to a landing zone and then doing CTAS to create the optimized ORC tables with table and column stats. Another option would be to use the new merge function to load incremental data (above youtube link). One approach is to use a CDC tool (attunity) + HDF to stream changes into a diff table and then at a regular interval use merge to update the SCD type 2 tables.
Hive ACID is different than what people are used to with RDBMS ACID. The transaction scope is only per table or partition. There are also no begin/…/commit statements. Everything is essentially auto-commit. What happens under the covers is that delta files are created for the table with the changes. Then hive manages minor and major compactions under the covers to merge the data. This is no doubt slower than a traditional RDBMS. But performance can be improved by increasing the number of compaction threads and running the updates across different tables or partitions. If you have the scenario where new data is being added (no deletes or updates, just inserts) then you can insert multiple rows per statement and this speeds up the ingest. Once the transaction commits the new data immediately available to the consumer. There is also the concept of batch transactions which can be used to increase performance when doing high velocity transactions. You would have to code for it though. The hive streaming storm bolt and nifi processors use it but your SQL gui tools won’t.
@mqureshi Thanks. We have implemented SCD type 2 in hive without using MERGE as the version which we used doesnt support MERGE. Does it mean that the insertion into HBASE and Hive will be optimal. Because in HBASE we can hold versions and it can be handled easily. Even in Hive with MERGE we can achieve it without much effort. But when it comes to Merge we have to perform a join between the source and target to identify the insert/Update/No update records. if the target is high and the delta records very less even then I believe that its uses high resource as we are performing join operations in hive. In such case is it better to go with HBASE or MERGE option in hive?