Support Questions
Find answers, ask questions, and share your expertise

how to select the multiple hive partitioned based on where conditions

how to select the multiple hive partitioned based on where conditions

Hi,

On the staging database we are getting updated data from relational database. As HDFS/Hive supports append only. So we need to delete the partitions where data is modified and replace with updated data partitions.

Lets say we have a T_UpdatedData table which gets populated by Sqoop per day. If on any data we have updated data from multiple partitions, we need to backup, delete those partitions from the T_ProductionData. Partitions are created on Year, Month, Account columns. Backup should be saved on the T_HistoryData

How to can select the different partitions from the T_UpdatedData & copy those partitions from T_ProductionData to T_HistoryData ?

Inserting dynamic partitions can be done with following syntax.

INSERT OVERWRITE TABLE tablename1 PARTITION (partcol1=val1, partcol2=val2...)

Two ways to select the distinct partition data could be

1. SELECT * FROM T_ProductionData WHERE (year,month,account) IN (SELECT distinct year, month, account FROM T_UpdatedData) -- Supported by Oracle.

2. Using joins. Using joins could be more costly as it will try to perform join in all T_ProductionData.

What is the best practice to backup the partitions and insert the modified partitions ?

Thanks

1 REPLY 1

Re: how to select the multiple hive partitioned based on where conditions

So if I understand it correctly you want to get the partitions that contain data in the staging table from the master table BEFORE inserting the data from the staging table and overwriting them to do some backup.

I don't think there is a real best practice. Normally you add partitions but not update them ( by year/month at least ). Or you know which partitions you update beforehand.

But you can definitely do an IN or join. Both "should" be equivalent in performance since Hive has a good Optimizer now and will translate the IN clause into a join as well. Also the join should be efficient because he should have a very small lookup table ( list of partitions ) that is Mapside Joined into the main table.

Just make sure there is a MapSide join. ( EXPLAIN )