Created on 05-20-2020 10:33 AM - edited 05-20-2020 10:41 AM
Hello guys!
I hope your day is going well.
Something really weird (at least for me) is happening when I drop a table in Hive.
1.- A table called "Test" is created by a sqoop import job using "--create-hcatalog-table". See the full job below.
Created 05-28-2020 08:46 PM
HI @Heri,
Here I just wanna add some points. You can use PURGE option to delete data file as well along with partition metadata but it works only in Internal/ Managed tables
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec PURGE;
But for External tables have a two-step process to alter table drop partition + removing file
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec;
hdfs dfs -rm -r <partition file path>
I hope this gives some insights here. cc @aakulov
Created 05-22-2020 12:09 PM
Hi Heri,
After you execute
drop table test purge;
can you check that the data is actually deleted? Do a query on the table first, but also check with hdfs dfs to see if the underlying files have been deleted from Hadoop (they should be). Let me know what you see.
You may be right that EXTERNAL table data does not get deleted, just the metadata is deleted. That's why I'm asking you to check for data with hdfs dfs. Now, to be able to drop the EXTERNAL table (both metadata and data) you'd need to follow the steps here: https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/using-hiveql/content/hive_drop_external_table_...
Hope that helps.
Created 05-26-2020 06:32 AM
Thank you for responding!
We've verified and the underlying files are not being deleted from Hadoop.
Created 05-26-2020 11:59 AM
Ok, so to be able to purge the table created by sqoop, because it is external, you'll need to add to your stanza:
--hcatalog-storage-stanza 'stored as parquet TBLPROPERTIES("external.table.purge"="true")'
Then when you load the data for the first time it will enable purging on that table. Executing purge command you have will then remove both metadata and the dat in the external table.
Let me know if that works and if the solution is acceptable.
Created 05-26-2020 01:35 PM
Thanks for your answer! This is a good learning experiencing for me.
I didn't know we could add the property of a table when creating the table.
Unfortunately, the table data still persists after the drop table command.
table name: test
When we create the table test, it is stored in: /data/<database>/test
When we use a different database where the issue is not present, the table is stored in /data/<database>/hive/test
Could it be a database configuration issue?
Created 05-26-2020 02:16 PM
The issue is that the DROP TABLE statement doesn't seem to remove the data from HDFS. This is usually caused by the table being an external table that doesn't allow Hive to perform all operations on it.
Another thing you can try is what's suggested in this thread (i.e. before you drop the table, change its property to be EXTERNAL=FALSE). Does that work for you?
Created 05-27-2020 07:01 AM
Thanks,
I gave it a try, but it still keeps the data before the table was dropped.
select count(*) from dataengsandbox.test;
ALTER TABLE test SET TBLPROPERTIES('EXTERNAL'='False');
drop table test purge;
select count(*) from dataengsandbox.test; -> Gave me the double of the records.
describe formatted test;
46 | Location: | hdfs://nameservice1/data/DataEngSandbox/test | NULL |
47 | Table Type: | MANAGED_TABLE | NULL |
48 | Table Parameters: | NULL | NULL |
49 | COLUMN_STATS_ACCURATE | false | |
50 | external.table.purge | true | |
51 | numFiles | 1 | |
52 | numRows | -1 | |
53 | rawDataSize | -1 | |
54 | totalSize | 19098747 | |
55 | transient_lastDdlTime | 1590587689 | |
56 | NULL | NULL | |
57 | # Storage Information | NULL | NULL |
58 | SerDe Library: | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | NULL |
59 | InputFormat: | org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | NULL |
60 | OutputFormat: | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | NULL |
61 | Compressed: | No | NULL |
62 | Num Buckets: | -1 | NULL |
63 | Bucket Columns: | [] | NULL |
64 | Sort Columns: | [] | NULL |
65 | Storage Desc Params: | NULL | NULL |
Created 05-27-2020 09:34 AM
Hmmm... I'm stumped. Perhaps @Shu_ashu can help?
Created 05-28-2020 08:46 PM
HI @Heri,
Here I just wanna add some points. You can use PURGE option to delete data file as well along with partition metadata but it works only in Internal/ Managed tables
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec PURGE;
But for External tables have a two-step process to alter table drop partition + removing file
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec;
hdfs dfs -rm -r <partition file path>
I hope this gives some insights here. cc @aakulov
Thanks for your answer!
That solution might work, but we've also ran a drop table in a different database, same job, and it didn't required to delete the fs. At this point we suspect that something is wrong in the database configuration.
We've requested our DBA team to re-create the database because they didn't follow the same procedure as the database that works properly.
Thank you very much for the help. It's really appreciated.