Created 10-18-2023 04:50 PM
I ran into an interesting situation using the Impala external table. In short, I used "create external table" statement but ended up with a table like a managed one. Here are details.
Step 1: creating an external table
created external table testdb1.table1 ( fld1 STRING, fld2 STRING ) PARTITIONED BY ( loaddate INT ) STORED AS PARQUET tblproperties('parquet.compress'='SNAPPY','transactional'='false');
Step 2: adding partitions and loading data files.
alter table testdb1.table1 add if not exists partition (loaddate=20231018);
load data inpath '/mytestdata/dir1' into table testdb1.table1 partition (loaddate=20231018);
Step 2 shows that this table1 behaves exactly like a managed table. Files at /mytestdata/dir1 are moved to hdfs warehouse path warehouse/tablespace/external/hive/testdb1.db/table1/loaddate=20231018 path. If I drop this partition 20231018, the directory at warehouse/tablespace/external/hive/testdb1.db/table1/loaddate=20231018 is removed.
So what exactly is the difference between a managed vs external partitioned table, except for the different storage location /warehouse/tablespace/managed vs /warehouse/tablespace/external? From what I read, the key difference is that a managed table's storage is managed by hive/impala, but an external table is not. In my case, even this table1 is created as an external table, its storage is still managed by impala/hive.
As I understand, if I add a partition (to an external table) and then add files using "load data inpath", then the storage is managed by hive. If I add a partition with the location specified, like
Created on 10-20-2023 11:53 AM - edited 10-20-2023 11:55 AM
I understand you created an EXTERNAL table without defining LOCATION, so it is an external table stored in default 'warehouse/tablespace/external/hive' location. The problem is that, when you drop a partition for the external table, you are seeing the partition also leave HDFS location.
SOLUTION
This should not be happening. As per documentation, if you have an external table, then partitions and data in HDFS should not be deleted if you drop the partition or drop the table. The partition and table will be dropped from Impala, but the underlying HDFS files will remain. For a MANAGED table, both Impala and HDFS data will be deleted. This is expected behavior.
Below documentation should answer all your questions about differences between EXTERNAL and MANAGED tables:
- Refer to documentation [1]:
- Also refer to this documentation:
You should double check the "show create table" output of your table to ensure it is an EXTERNAL table. Also keep in mind these points:
1) create table by default creates MANAGED table.
2) adding EXTERNAL to a create table creates an EXTERNAL table.
3) also adding transaction=false automatically creates an EXTERNAL table.
REPRODUCTION
Because of the discrepancy in your description, I reproduced this situation internally. I don't see your behavior, but instead see the exact same behavior as described in documentation.
1) I create an external table:
a) create external table table1_external ( fld1 STRING, fld2 STRING ) PARTITIONED BY ( loaddate INT ) STORED AS PARQUET tblproperties('parquet.compress'='SNAPPY','transactional'='false');
2) I create partition in this external table:
a) alter table table1_external add if not exists partition (loaddate=20231018);
3) I load data into this new partition:
a) load data inpath '/tmp/delta_1_1/' into table table1_external partition (loaddate=20231018);
4) I check that I have partition and a new file loaded in it, within Impala and HDFS:
a) show partitions table1_external;
loaddate #Rows #Files Size Bytes Cached Cache Replication Format Incremental stats Location 20231018 -1 1 649B NOT CACHED NOT CACHED PARQUET false .../warehouse/tablespace/external/hive/table1_external/loaddate=20231018 Total -1 1 649B 0B
b) # hdfs dfs -ls /warehouse/tablespace/external/hive/table1_external/loaddate=20231018 Found 1 items -rwxrwxrwx 3 hdfs supergroup 649 2023-10-20 17:33 /warehouse/tablespace/external/hive/table1_external/loaddate=20231018/8e469210ba3cea3c-162bfe3c00000000_1156574353_data.0.parq
5) I drop the new partition:
a) alter table table1_external drop partition (loaddate=20231018);
6) I check that I have partition and a new file loaded in it, within Impala and HDFS. It DOES NOT exist in Impala, but it DOES still exist in HDFS:
a) show partitions table1_external; loaddate #Rows #Files Size Bytes Cached Cache Replication Format Incremental stats Location Total -1 0 0B 0B
b) # hdfs dfs -ls /warehouse/tablespace/external/hive/table1_external/loaddate=20231018 Found 1 items -rwxrwxrwx 3 hdfs supergroup 649 2023-10-20 17:33 /warehouse/tablespace/external/hive/table1_external/loaddate=20231018/8e469210ba3cea3c-162bfe3c00000000_1156574353_data.0.parq
7) I even drop the entire table in impala, and hdfs files still exist:
a) drop table table1_external;
b) show partitions table1_external; Table does not exist: default.table1_external
c) # hdfs dfs -ls /warehouse/tablespace/external/hive/table1_external/loaddate=20231018 Found 1 items -rwxrwxrwx 3 hdfs supergroup 649 2023-10-20 17:33 /warehouse/tablespace/external/hive/table1_external/loaddate=20231018/8e469210ba3cea3c-162bfe3c00000000_1156574353_data.0.parq
😎 I can even bring the external table data back by creating table and adding partition again:
a) create external table table1_external ( fld1 STRING, fld2 STRING ) PARTITIONED BY ( loaddate INT ) STORED AS PARQUET tblproperties('parquet.compress'='SNAPPY','transactional'='false');
b) select * from table1_external; Done. 0 results.
c) alter table table1_external add if not exists partition (loaddate=20231018);
d) select * from table1_external; fld1 fld2 loaddate testfld1 testfld2 20231018
9) Finally, testing this on an INTERNAL/MANAGED table instead shows when I drop the table, the HDFS data is also gone. I am unable to drop a single partition because my INTERNAL table was transactional=true.
REFERENCES
_____________________
Internal and external tables (EXTERNAL and LOCATION clauses): By default, Impala creates an internal table, where Impala manages the underlying data files for the table, and physically deletes the data files when you drop the table. If you specify the EXTERNAL clause, Impala treats the table as an external table, where the data files are typically produced outside Impala and queried from their original locations in HDFS, and Impala leaves the data files in place when you drop the table. Typically, for an external table you include a LOCATION clause to specify the path to the HDFS directory where Impala reads and writes files for the table. For example, if your data pipeline produces Parquet files in the HDFS directory /user/etl/destination, you might create an external table as follows: CREATE EXTERNAL TABLE external_parquet (c1 INT, c2 STRING, c3 TIMESTAMP) STORED AS PARQUET LOCATION '/user/etl/destination'; Although the EXTERNAL and LOCATION clauses are often specified together, LOCATION is optional for external tables, and you can also specify LOCATION for internal tables. The difference is all about whether Impala takes control of the underlying data files and moves them when you rename the table, or deletes them when you drop the table. For more about internal and external tables and how they interact with the LOCATION attribute, see Overview of Impala tables.
_____________________
Created on 10-20-2023 11:53 AM - edited 10-20-2023 11:55 AM
I understand you created an EXTERNAL table without defining LOCATION, so it is an external table stored in default 'warehouse/tablespace/external/hive' location. The problem is that, when you drop a partition for the external table, you are seeing the partition also leave HDFS location.
SOLUTION
This should not be happening. As per documentation, if you have an external table, then partitions and data in HDFS should not be deleted if you drop the partition or drop the table. The partition and table will be dropped from Impala, but the underlying HDFS files will remain. For a MANAGED table, both Impala and HDFS data will be deleted. This is expected behavior.
Below documentation should answer all your questions about differences between EXTERNAL and MANAGED tables:
- Refer to documentation [1]:
- Also refer to this documentation:
You should double check the "show create table" output of your table to ensure it is an EXTERNAL table. Also keep in mind these points:
1) create table by default creates MANAGED table.
2) adding EXTERNAL to a create table creates an EXTERNAL table.
3) also adding transaction=false automatically creates an EXTERNAL table.
REPRODUCTION
Because of the discrepancy in your description, I reproduced this situation internally. I don't see your behavior, but instead see the exact same behavior as described in documentation.
1) I create an external table:
a) create external table table1_external ( fld1 STRING, fld2 STRING ) PARTITIONED BY ( loaddate INT ) STORED AS PARQUET tblproperties('parquet.compress'='SNAPPY','transactional'='false');
2) I create partition in this external table:
a) alter table table1_external add if not exists partition (loaddate=20231018);
3) I load data into this new partition:
a) load data inpath '/tmp/delta_1_1/' into table table1_external partition (loaddate=20231018);
4) I check that I have partition and a new file loaded in it, within Impala and HDFS:
a) show partitions table1_external;
loaddate #Rows #Files Size Bytes Cached Cache Replication Format Incremental stats Location 20231018 -1 1 649B NOT CACHED NOT CACHED PARQUET false .../warehouse/tablespace/external/hive/table1_external/loaddate=20231018 Total -1 1 649B 0B
b) # hdfs dfs -ls /warehouse/tablespace/external/hive/table1_external/loaddate=20231018 Found 1 items -rwxrwxrwx 3 hdfs supergroup 649 2023-10-20 17:33 /warehouse/tablespace/external/hive/table1_external/loaddate=20231018/8e469210ba3cea3c-162bfe3c00000000_1156574353_data.0.parq
5) I drop the new partition:
a) alter table table1_external drop partition (loaddate=20231018);
6) I check that I have partition and a new file loaded in it, within Impala and HDFS. It DOES NOT exist in Impala, but it DOES still exist in HDFS:
a) show partitions table1_external; loaddate #Rows #Files Size Bytes Cached Cache Replication Format Incremental stats Location Total -1 0 0B 0B
b) # hdfs dfs -ls /warehouse/tablespace/external/hive/table1_external/loaddate=20231018 Found 1 items -rwxrwxrwx 3 hdfs supergroup 649 2023-10-20 17:33 /warehouse/tablespace/external/hive/table1_external/loaddate=20231018/8e469210ba3cea3c-162bfe3c00000000_1156574353_data.0.parq
7) I even drop the entire table in impala, and hdfs files still exist:
a) drop table table1_external;
b) show partitions table1_external; Table does not exist: default.table1_external
c) # hdfs dfs -ls /warehouse/tablespace/external/hive/table1_external/loaddate=20231018 Found 1 items -rwxrwxrwx 3 hdfs supergroup 649 2023-10-20 17:33 /warehouse/tablespace/external/hive/table1_external/loaddate=20231018/8e469210ba3cea3c-162bfe3c00000000_1156574353_data.0.parq
😎 I can even bring the external table data back by creating table and adding partition again:
a) create external table table1_external ( fld1 STRING, fld2 STRING ) PARTITIONED BY ( loaddate INT ) STORED AS PARQUET tblproperties('parquet.compress'='SNAPPY','transactional'='false');
b) select * from table1_external; Done. 0 results.
c) alter table table1_external add if not exists partition (loaddate=20231018);
d) select * from table1_external; fld1 fld2 loaddate testfld1 testfld2 20231018
9) Finally, testing this on an INTERNAL/MANAGED table instead shows when I drop the table, the HDFS data is also gone. I am unable to drop a single partition because my INTERNAL table was transactional=true.
REFERENCES
_____________________
Internal and external tables (EXTERNAL and LOCATION clauses): By default, Impala creates an internal table, where Impala manages the underlying data files for the table, and physically deletes the data files when you drop the table. If you specify the EXTERNAL clause, Impala treats the table as an external table, where the data files are typically produced outside Impala and queried from their original locations in HDFS, and Impala leaves the data files in place when you drop the table. Typically, for an external table you include a LOCATION clause to specify the path to the HDFS directory where Impala reads and writes files for the table. For example, if your data pipeline produces Parquet files in the HDFS directory /user/etl/destination, you might create an external table as follows: CREATE EXTERNAL TABLE external_parquet (c1 INT, c2 STRING, c3 TIMESTAMP) STORED AS PARQUET LOCATION '/user/etl/destination'; Although the EXTERNAL and LOCATION clauses are often specified together, LOCATION is optional for external tables, and you can also specify LOCATION for internal tables. The difference is all about whether Impala takes control of the underlying data files and moves them when you rename the table, or deletes them when you drop the table. For more about internal and external tables and how they interact with the LOCATION attribute, see Overview of Impala tables.
_____________________
Created 10-23-2023 03:43 PM
Ezerihun,
Thanks for your reply. I repeated my test, which showed that you are correct. I was not sure what happened to my test case previously. When I dropped an external table, the warehouse path for that table "warehouse/tablespace/external/hive/testdb1.db/table1" remains. Actually, I can even re-create that external table again without any error, and files loaded to "warehouse/tablespace/external/hive/testdb1.db/table1" can be read through the re-created table. In other words, although Impala created this path "warehouse/tablespace/external/hive/testdb1.db/table1", Impala does not manage it at all.
Thank you.
Created 10-24-2023 07:29 AM
Great! Yes, your new explanation matches what is expected in documentation and my personal reproduction. I am glad it is all sorted out now.