Support Questions

Find answers, ask questions, and share your expertise

How to "defragment" hdfs data?

avatar
Super Collaborator

Suppose a scenario with a Hive table that is partitioned by day ("day=2017-12-12").

Suppose some process pushes data to the file store behind this table (new data under "day=2017-12-12" and "day=2017-12-13", etc).

The "msck repair table" command updates the metastore to recognize all the new "chunks", and the data correctly shows up in queries.

But suppose these chunks are mostly very small - is there a simple command to consolidate these?

So instead of 100 small files under a partition, I get 2 well-sized ones, etc.

I recognize that I can create a copy of the table and accomplish this, but that seems pretty clumsy.

Is there some kind of hdfs command to "defrag" the data?

FWIW, I'm using EMR with data in S3.

Thanks in advance.

1 ACCEPTED SOLUTION

avatar

@Zack Riesland

You can create a separate table with only the current day's partition. Merge/consolidate the small files into it.

Then run an exchange partition into the main table to achieve this.

In that way, you do not need to play with the entire data in the main table and this kind of gives you a clean way to achieve this.

Create a shell script and put the commands and do the boundary checks like the end of the day, reprocessing etc and you will have an airtight solution.

Thanks,

Raj

View solution in original post

2 REPLIES 2

avatar

@Zack Riesland

You can create a separate table with only the current day's partition. Merge/consolidate the small files into it.

Then run an exchange partition into the main table to achieve this.

In that way, you do not need to play with the entire data in the main table and this kind of gives you a clean way to achieve this.

Create a shell script and put the commands and do the boundary checks like the end of the day, reprocessing etc and you will have an airtight solution.

Thanks,

Raj

avatar
Super Collaborator

Thanks for the feedback