On the staging database we are getting updated data from relational database. As HDFS/Hive supports append only. So we need to delete the partitions where data is modified and replace with updated data partitions.
Lets say we have a T_UpdatedData table which gets populated by Sqoop per day. If on any data we have updated data from multiple partitions, we need to backup, delete those partitions from the T_ProductionData. Partitions are created on Year, Month, Account columns. Backup should be saved on the T_HistoryData
How to can select the different partitions from the T_UpdatedData & copy those partitions from T_ProductionData to T_HistoryData ?
Inserting dynamic partitions can be done with following syntax.
So if I understand it correctly you want to get the partitions that contain data in the staging table from the master table BEFORE inserting the data from the staging table and overwriting them to do some backup.
I don't think there is a real best practice. Normally you add partitions but not update them ( by year/month at least ). Or you know which partitions you update beforehand.
But you can definitely do an IN or join. Both "should" be equivalent in performance since Hive has a good Optimizer now and will translate the IN clause into a join as well. Also the join should be efficient because he should have a very small lookup table ( list of partitions ) that is Mapside Joined into the main table.
Just make sure there is a MapSide join. ( EXPLAIN )