Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

DELETE rows in table, how HDFS file size is impacted ?

avatar
Explorer

Hello.

Before deleting rows in a specific table (463,462 rows in table), HDFS file size is :

$ hadoop fs -du -s -h /apps/hive/warehouse/prd_thmil.db/th_mil_fb_code_value_brut
54.2 M 162.6 M /apps/hive/warehouse/prd_thmil.db/th_mil_fb_code_value_brut

54.2 Mb is the size of 1 file and each file is replicated 2 times so 162.6 Mb is the total size, it OK.

But after deleting more than 450,000 rows in the table (12,890 rows remaining after the DELETE), the file size didn't change at all.

Is it normal ? When new rows are added in the table, file size won't grow and HDFS will 'overwrite' older data with the new one ?

Regards

1 ACCEPTED SOLUTION

avatar
New Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
3 REPLIES 3

avatar
Expert Contributor

Existing files won't be rewritten by delete query, instead deleted rows ROW__ID will be written in new delete_delta folder. Read queries will apply deleted ROW__ID on existing files to exclude the rows.

Triggering Major compaction on the table will rewrite new files merging delta & delete_delta folder.

avatar
New Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Explorer

Thank you, guys, for your answers.