Hi @yassan ,
let's recap some important concept, then I will add my comment to your points.
Generally speaking:
- If you issue a SQL statement Hive/Impala, the engine is fully aware of what has been requested and will do everything needed. E.g. if you drop a partition, the engine knows that metadata has to be updated and knows if data has to be purged to or not (e.g. if it's an external table, data on filesystem won't be deleted by default).
NOTE: if you want drop statements to delete data as well, you would need a managed (non external) table. You might also try to alter your table and set this TBLPROPERTIES ("external.table.purge"="true"). Honestly I'm not sure if this is available in your version of Hive, it is for sure in more up to date versions (e.g. Cloudera Data Platform). - If you delete data directly on the filesystem (e.g. via a Spark job or via hdfs CLI), there is no way for Hive/Impala engine to know that it happened unless you explicitly tell them. This is something you can do by launching a "MSCK REPAIR TABLE [...]" on Hive or by launching an "ALTER TABLE tablename DROP PARTITION [...]" either on Hive or Impala.
Well actually, if you are using Spark you could rely on sparksql and issue a drop partition statement (see the summary at the end of this post). - Impala relies on Hive metastore but caches metadata. If you make some changes to metadata via Hive, then you'll have to launch an "INVALIDATE METADATA" on Impala in order to refresh the cache.
These are key points to be taken into account.
Commenting your last post:
If you have a lot of partitions you have a couple of ways to lower down the effort of launching a number of drop partitions statements:
- of course, you can script it (or you could develop a spark job, or you could maybe come up with some other automation strategy)
- if the partition field allows for it, you could drop a range of partition with a single statement. E.g. it could be something like "ALTER TABLE my_table_partitioned_by_year DROP PARTITION (year < 2020)". You can do this from Impala if you prefer, so that you won't have to refresh the Impala cache.
This will drop partitions but won't ever drop the table.
Now, summarizing everything we've shared so far, you have two alternatives three possible ways to go:
- Do it via Impala by using the drop partition SQL statement.
- Delete data directly on the filesystem and lately tell Impala to drop the partition (drop partition statements in Impala or MSCK REPAIR on Hive + INVALIDATE METADATA on Impala).
- Use a Spark job and issue a drop partition statement via Spark SQL + INVALIDATE METADATA on Impala (since the Spark job would directly act on the Hive metastore, out of Impala's line of sight).
Hope this helps