Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to "defragment" hdfs data?

Solved Go to solution

How to "defragment" hdfs data?

Super Collaborator

Suppose a scenario with a Hive table that is partitioned by day ("day=2017-12-12").

Suppose some process pushes data to the file store behind this table (new data under "day=2017-12-12" and "day=2017-12-13", etc).

The "msck repair table" command updates the metastore to recognize all the new "chunks", and the data correctly shows up in queries.

But suppose these chunks are mostly very small - is there a simple command to consolidate these?

So instead of 100 small files under a partition, I get 2 well-sized ones, etc.

I recognize that I can create a copy of the table and accomplish this, but that seems pretty clumsy.

Is there some kind of hdfs command to "defrag" the data?

FWIW, I'm using EMR with data in S3.

Thanks in advance.

1 ACCEPTED SOLUTION

Accepted Solutions

Re: How to "defragment" hdfs data?

@Zack Riesland

You can create a separate table with only the current day's partition. Merge/consolidate the small files into it.

Then run an exchange partition into the main table to achieve this.

In that way, you do not need to play with the entire data in the main table and this kind of gives you a clean way to achieve this.

Create a shell script and put the commands and do the boundary checks like the end of the day, reprocessing etc and you will have an airtight solution.

Thanks,

Raj

2 REPLIES 2

Re: How to "defragment" hdfs data?

@Zack Riesland

You can create a separate table with only the current day's partition. Merge/consolidate the small files into it.

Then run an exchange partition into the main table to achieve this.

In that way, you do not need to play with the entire data in the main table and this kind of gives you a clean way to achieve this.

Create a shell script and put the commands and do the boundary checks like the end of the day, reprocessing etc and you will have an airtight solution.

Thanks,

Raj

Re: How to "defragment" hdfs data?

Super Collaborator

Thanks for the feedback