Created 09-08-2022 01:44 AM
I am trying to set partition retention times on existing Hive manged tables using the following:
ALTER TABLE <table name> SET TBLPROPERTIES ('discover.partitions'='true');
ALTER TABLE <table name> SET TBLPROPERTIES ('partition.retention.period'='1d');
as stated on this page below, however I am still able to search partitions older than a day so it appears to not be working? It does mention on the page that this is for 'external' tables, can anyone let me know if this an 'age off' retention period is possible on managed tables? am I missing any commands etc?
Thanks in advance
Created on 09-13-2022 10:50 AM - edited 09-13-2022 11:04 AM
Can you let me know your cluster version ?
Please check AutoPartitionDiscovery thread (PartitionManagementTask) is running in HMS logs.
If it is running, you should be seeing following log line in HMS. Make sure you don't have specific db/table pattern.
HMS
org.apache.hadoop.hive.metastore.PartitionManagementTask: [Metastore Scheduled Worker 3]: Looking for tables using catalog: hive dbPattern: * tablePattern: * found: 107
HS2/HMS
org.apache.hadoop.hive.metastore.HiveMetaStoreChecker: [HiveServer2-Background-Pool: Thread-123]: Number of partitionsNotInMs=[], partitionsNotOnFs=[], tablesNotInMs=[], tablesNotOnFs=[], expiredPartitions=[]
org.apache.hadoop.hive.metastore.Msck: [HiveServer2-Background-Pool: Thread-123]: hive.default.test_table - #partsNotInMs: 0 #partsNotInFs: 0 #expiredPartitions: 0 lockRequired: true (R: true A: true true)
In CDP cluster managed by CM, set metastore.msck.repair.enable.partition.retention=true at Hive -> Configuration -> Hive Metastore Server Advanced Configuration Snippet (Safety Valve) for hive-site.xml
In HDP-3.x managed by Ambari, set metastore.msck.repair.enable.partition.retention=true in Hive -> custom hms-site
Restart applicable services. expiredPartitions should be visible in logs & should be removed by HMS at scheduled interval once they become eligible.
Created 09-08-2022 08:03 PM
Partition Retention is disabled by default in HDP-3.1.4.x
set metastore.msck.repair.enable.partition.retention=true in ambari -> Hive -> custom hms-site & restart HMS. HMS should automatically take care.
Created 09-13-2022 12:34 AM
Thanks for the reply @nramanaiah. I seem to be unable to find an option 'metastore.msck.repair.enable.partition.retention' does it need to be added as a custom option and if so under which drop down? Thanks
Created on 09-13-2022 10:50 AM - edited 09-13-2022 11:04 AM
Can you let me know your cluster version ?
Please check AutoPartitionDiscovery thread (PartitionManagementTask) is running in HMS logs.
If it is running, you should be seeing following log line in HMS. Make sure you don't have specific db/table pattern.
HMS
org.apache.hadoop.hive.metastore.PartitionManagementTask: [Metastore Scheduled Worker 3]: Looking for tables using catalog: hive dbPattern: * tablePattern: * found: 107
HS2/HMS
org.apache.hadoop.hive.metastore.HiveMetaStoreChecker: [HiveServer2-Background-Pool: Thread-123]: Number of partitionsNotInMs=[], partitionsNotOnFs=[], tablesNotInMs=[], tablesNotOnFs=[], expiredPartitions=[]
org.apache.hadoop.hive.metastore.Msck: [HiveServer2-Background-Pool: Thread-123]: hive.default.test_table - #partsNotInMs: 0 #partsNotInFs: 0 #expiredPartitions: 0 lockRequired: true (R: true A: true true)
In CDP cluster managed by CM, set metastore.msck.repair.enable.partition.retention=true at Hive -> Configuration -> Hive Metastore Server Advanced Configuration Snippet (Safety Valve) for hive-site.xml
In HDP-3.x managed by Ambari, set metastore.msck.repair.enable.partition.retention=true in Hive -> custom hms-site
Restart applicable services. expiredPartitions should be visible in logs & should be removed by HMS at scheduled interval once they become eligible.
Created 09-14-2022 08:40 AM
@nramanaiah thanks very much for the help!
Created 09-20-2022 04:46 AM
@nramanaiah I'm still experiencing some issues with this, I have applied metastore.msck.repair.enable.partition.retention=true and restarted, all looks good as below. I have applied the ALTER table statements to set the retention of 1 day to a test table without error, but when I do a SELECT statement in Beeline I can still see data from last week?
Any idea what I'm missing?
Created on 09-22-2022 06:24 PM - edited 09-22-2022 06:36 PM
Partition discovery & retention is only for external table & data purge is disabled by default for external tables.
ie., If tblproperties has ('discover.partitions'='true' & 'partition.retention.period'='2m' ) without 'external.table.purge'='true'. Consecutive iterations will repeatedly add/remove partition till underlying HDFS partition folder is removed. I suspect you are seeing the same situation.
eg.,
==> Expired partition is removed from Metadata in Iteration 1 <==
2022-09-23T02:43:36,786 INFO [PartitionDiscoveryTask-1]: metastore.Msck (Msck.java:repair(135)) - hive.default.test - #partsNotInMs: 0 #partsNotInFs: 0 #expiredPartitions: 1 lockRequired: true (R: true A: true true)
2022-09-23T02:43:36,822 INFO [PartitionDiscoveryTask-1]: metastore.Msck (Msck.java:repair(287)) - Expired partitions: [test:dept=cse]
==> As HDFS partition folder exist, discover.partition added it back in Iteration 2 <==
2022-09-23T02:48:36,818 INFO [PartitionDiscoveryTask-1]: metastore.HiveMetaStoreChecker (HiveMetaStoreChecker.java:checkMetastore(144)) - Number of partitionsNotInMs=[test:dept=cse], partitionsNotOnFs=[], tablesNotInMs=[], tablesNotOnFs=[], expiredPartitions=[]
2022-09-23T02:48:36,818 INFO [PartitionDiscoveryTask-1]: metastore.Msck (Msck.java:repair(135)) - hive.default.test - #partsNotInMs: 1 #partsNotInFs: 0 #expiredPartitions: 0 lockRequired: true (R: true A: true true)
==> Expired partition is removed from Metadata in Iteration 3 <==
2022-09-23T02:53:36,805 INFO [PartitionDiscoveryTask-1]: metastore.Msck (Msck.java:repair(135)) - hive.default.test - #partsNotInMs: 0 #partsNotInFs: 0 #expiredPartitions: 1 lockRequired: true (R: true A: true true)
2022-09-23T02:53:36,837 INFO [PartitionDiscoveryTask-1]: metastore.Msck (Msck.java:repair(287)) - Expired partitions: [test:dept=cse]
Please use following 3 tblproperties accordingly to make it work as you expect.
1) 'discover.partitions'='true' -- To discover partitions from HDFS path or remove partition metadata from HMS when HDFS path is removed.
2) 'partition.retention.period'='<retention_period>' -- To remove partitions after retention period
3) 'external.table.purge'='true' -- To remove HDFS partition folder, so that partition discover won't add it back to HMS.
PS: To enable global external table purge, set hive.external.table.purge.default=true cluster wide in ambari -> hive -> custom hive-site
Created 09-26-2022 03:54 PM
@Griggsy Hope above suggestion helped. Let me know if otherwise.
Created 09-27-2022 01:53 AM
@nramanaiah I haven't had a chance to do further testing yet, I will let you know ASAP. Thanks again for the help.
Created 10-03-2022 07:16 AM
@nramanaiah have been able to run further testing and confirm that my partitions are purging as expected! thanks again for the assistance!