question Re: How to "defragment" hdfs data? in Archives of Support Questions (Read Only)

How to "defragment" hdfs data?

zack_riesland — Thu, 04 Jan 2018 03:59:06 GMT

Suppose a scenario with a Hive table that is partitioned by day ("day=2017-12-12").

Suppose some process pushes data to the file store behind this table (new data under "day=2017-12-12" and "day=2017-12-13", etc).

The "msck repair table" command updates the metastore to recognize all the new "chunks", and the data correctly shows up in queries.

But suppose these chunks are mostly very small - is there a simple command to consolidate these?

So instead of 100 small files under a partition, I get 2 well-sized ones, etc.

I recognize that I can create a copy of the table and accomplish this, but that seems pretty clumsy.

Is there some kind of hdfs command to "defrag" the data?

FWIW, I'm using EMR with data in S3.

Thanks in advance.

Re: How to "defragment" hdfs data?

rbiswas1 — Thu, 04 Jan 2018 21:37:51 GMT

@Zack Riesland

You can create a separate table with only the current day's partition. Merge/consolidate the small files into it.

Then run an exchange partition into the main table to achieve this.

In that way, you do not need to play with the entire data in the main table and this kind of gives you a clean way to achieve this.

Create a shell script and put the commands and do the boundary checks like the end of the day, reprocessing etc and you will have an airtight solution.

Thanks,

Raj

Re: How to "defragment" hdfs data?

zack_riesland — Mon, 08 Jan 2018 22:51:47 GMT

Thanks for the feedback