Created on 04-21-201712:18 PM - edited 08-17-201901:17 PM
With the release of HDP 2.6 comes the Hive ACID Merge feature. This is long sought after feature for people who want to batch update table content in ACID manner. This tutorial will walk you through step-by-step how to use this new feature with some really simple dataset. The example is done on HDP 18.104.22.168 installation.
First, ACID in Hive need to be turned on in Ambari UI
Then we will create 2 tables, one as the target of merge and one as the source of merge. Please note that the target table must be bucketed, set as transaction enabled and stored in orc format.
CREATE DATABASE merge_data;
CREATE TABLE merge_data.transactions(
PARTITIONED BY (tran_date string)
CLUSTERED BY (ID) into 5 buckets
STORED AS ORC TBLPROPERTIES ('transactional'='true');
CREATE TABLE merge_data.merge_source(
STORED AS ORC;
Then we will populate the target and source table with some data.
When we examine the 2 tables, we would expect after the merge, row 1 should be left untouched, row 4 should be deleted (we imply a business rule here: NULL value indicates deletion), row 7 will be update and row 11 will be insertion for new value.
The more interesting use case is with row 8, where it involves move the row from one partition to another. Merge currently does not support change of partition value on the fly. This need to happen as delete in the old partition and insert in the new partition. In real world use case, you need to construct the source table base on this critiria.
Then we will create the merge statement as the following. Please note that not all 3 WHEN of the merge statements need to exist, it is fine to have only 2 or even 1 of the WHEN statement. We labeled the data with different last_update_user. For more details on Hive Merge, please refer to Hive document
MERGE INTO merge_data.transactions AS T
USING merge_data.merge_source AS S
ON T.ID = S.ID and T.tran_date = S.tran_date
WHEN MATCHED AND (T.TranValue != S.TranValue AND S.TranValue IS NOT NULL) THEN UPDATE SET TranValue = S.TranValue, last_update_user = 'merge_update'
WHEN MATCHED AND S.TranValue IS NULL THEN DELETE
WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.TranValue, 'merge_insert', S.tran_date);
As part of the update clause, the set value statement should not contain the target table decorator "T.", otherwise you will get SQL compile error.
Once the merge finished, re-examining the data shows the data is merged just as expected
row 1 wasn't changed ; row 4 was deleted; row 7 was updated and row 11 was inserted. And row 8, as we see, got moved to a new partition.
This simple example provides instructions on how to use ACID merge with HDP 2.6 or later. Of course the real world use case would be much more complicated comparing to this oversimplified example, but they all follow the same principles. And the beauty behind this is, you don't need to create a ETL process to accomplish this any more.