Cloudera Community

Support Questions

Find answers, ask questions, and share your expertise

Advanced Search

Solved

yassan

Explorer

Hi, All !

I am having the same problem as below.

Drop empty Impala partitions - Stack Overflow

Impala external table partitions still show up in stats with row count 0 after deleting the data in HDFS and altering (like ALTER TABLE table RECOVER PARTITIONS) refreshing (REFRESH table) and invalidation of metadata.

Trying to drop partitions one by one works, but there are tens of partitions which should be removed and it would be quite tedious.

Dropping and recreating the table would also be an option but that way all the statistics would be dropped together with the table.

Is there any kind of other options in impala to get this done?

Is this only a workaround using Hive below?

9,575 Views

1 ACCEPTED SOLUTION

amallegni

Cloudera Employee

Hi @yassan ,

let's recap some important concept, then I will add my comment to your points.
Generally speaking:

If you issue a SQL statement Hive/Impala, the engine is fully aware of what has been requested and will do everything needed. E.g. if you drop a partition, the engine knows that metadata has to be updated and knows if data has to be purged to or not (e.g. if it's an external table, data on filesystem won't be deleted by default).
NOTE: if you want drop statements to delete data as well, you would need a managed (non external) table. You might also try to alter your table and set this TBLPROPERTIES ("external.table.purge"="true"). Honestly I'm not sure if this is available in your version of Hive, it is for sure in more up to date versions (e.g. Cloudera Data Platform).
If you delete data directly on the filesystem (e.g. via a Spark job or via hdfs CLI), there is no way for Hive/Impala engine to know that it happened unless you explicitly tell them. This is something you can do by launching a "MSCK REPAIR TABLE [...]" on Hive or by launching an "ALTER TABLE tablename DROP PARTITION [...]" either on Hive or Impala.
Well actually, if you are using Spark you could rely on sparksql and issue a drop partition statement (see the summary at the end of this post).
Impala relies on Hive metastore but caches metadata. If you make some changes to metadata via Hive, then you'll have to launch an "INVALIDATE METADATA" on Impala in order to refresh the cache.

These are key points to be taken into account.

Commenting your last post:

If you have a lot of partitions you have a couple of ways to lower down the effort of launching a number of drop partitions statements:

of course, you can script it (or you could develop a spark job, or you could maybe come up with some other automation strategy)
if the partition field allows for it, you could drop a range of partition with a single statement. E.g. it could be something like "ALTER TABLE my_table_partitioned_by_year DROP PARTITION (year < 2020)". You can do this from Impala if you prefer, so that you won't have to refresh the Impala cache.

This will drop partitions but won't ever drop the table.

Now, summarizing everything we've shared so far, you have two alternatives three possible ways to go:

Do it via Impala by using the drop partition SQL statement.
Delete data directly on the filesystem and lately tell Impala to drop the partition (drop partition statements in Impala or MSCK REPAIR on Hive + INVALIDATE METADATA on Impala).
Use a Spark job and issue a drop partition statement via Spark SQL + INVALIDATE METADATA on Impala (since the Spark job would directly act on the Hive metastore, out of Impala's line of sight).

Hope this helps

View solution in original post

9,389 Views

4 REPLIES 4

amallegni

Cloudera Employee

Hi Yassan,

first recommendation I have: when you need to drop a partition, it is better to do it via SQL statement either on Impala/Hive or with Spark SQL.
For example, assuming that "year" is my partitioning field:

alter table my_partitioned_table drop partition (year = 2020);

If you drop a partition at file system level, there are two things you should do in order to have everything aligned on Impala:

first run "MSCK REPAIR TABLE my_partitioned_table" on Hive, in order to refresh the metastore with the correct partitions' information
once point 1 is done, run "INVALIDATE METADATA" on Impala, so to refresh Impala cache

Let me know if this helps.
Regards

9,432 Views

yassan

Explorer

Hi @amallegni ,

Thank you for your response.

My important points are as follows

impala external table partitions
Trying to drop partitions one by one work, but there are tens of partitions which should be removed and it would be quite tedious.
I don't want to Dropping and recreating tables.
is the only way to use Hive's "MSCK REPAIR TABLE tablename SYNC PARTITIONS"
(i.e., can't Impala just complete it?)

Please let me know if you have any solutions for the above.

9,412 Views

amallegni

Cloudera Employee

Hi @yassan ,

let's recap some important concept, then I will add my comment to your points.
Generally speaking:

If you issue a SQL statement Hive/Impala, the engine is fully aware of what has been requested and will do everything needed. E.g. if you drop a partition, the engine knows that metadata has to be updated and knows if data has to be purged to or not (e.g. if it's an external table, data on filesystem won't be deleted by default).
NOTE: if you want drop statements to delete data as well, you would need a managed (non external) table. You might also try to alter your table and set this TBLPROPERTIES ("external.table.purge"="true"). Honestly I'm not sure if this is available in your version of Hive, it is for sure in more up to date versions (e.g. Cloudera Data Platform).
If you delete data directly on the filesystem (e.g. via a Spark job or via hdfs CLI), there is no way for Hive/Impala engine to know that it happened unless you explicitly tell them. This is something you can do by launching a "MSCK REPAIR TABLE [...]" on Hive or by launching an "ALTER TABLE tablename DROP PARTITION [...]" either on Hive or Impala.
Well actually, if you are using Spark you could rely on sparksql and issue a drop partition statement (see the summary at the end of this post).
Impala relies on Hive metastore but caches metadata. If you make some changes to metadata via Hive, then you'll have to launch an "INVALIDATE METADATA" on Impala in order to refresh the cache.

These are key points to be taken into account.

Commenting your last post:

If you have a lot of partitions you have a couple of ways to lower down the effort of launching a number of drop partitions statements:

of course, you can script it (or you could develop a spark job, or you could maybe come up with some other automation strategy)
if the partition field allows for it, you could drop a range of partition with a single statement. E.g. it could be something like "ALTER TABLE my_table_partitioned_by_year DROP PARTITION (year < 2020)". You can do this from Impala if you prefer, so that you won't have to refresh the Impala cache.

This will drop partitions but won't ever drop the table.

Now, summarizing everything we've shared so far, you have two alternatives three possible ways to go:

Do it via Impala by using the drop partition SQL statement.
Delete data directly on the filesystem and lately tell Impala to drop the partition (drop partition statements in Impala or MSCK REPAIR on Hive + INVALIDATE METADATA on Impala).
Use a Spark job and issue a drop partition statement via Spark SQL + INVALIDATE METADATA on Impala (since the Spark job would directly act on the Hive metastore, out of Impala's line of sight).

Hope this helps

9,390 Views