Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How can one change block size for large existing HDFS data set?

avatar
Rising Star

I have hundreds of thousands of small data blocks (< 64MB) that I'd like to turn into a more manageable number of larger blocks, say, 128MB or 256MB. This is CSV data. How can I do this with a distributed job, and can it be done "in place", i.e., without temporarily doubling the space requirement?

1 ACCEPTED SOLUTION

avatar

There is no effective way to change block size "in place". The concept of block size is tightly tied to the on-disk layout of block files at DataNodes, so it's non-trivial to change this.

As far as running a distributed job to do this, it's possible to use distcp with an override of the block size on the command line. (See example below.) This does however cause a temporary doubling of the storage consumed.

> hadoop distcp -D dfs.blocksize=268435456 /input /output



> hdfs dfs -stat 'name=%n blocksize=%o' /input/hello

name=hello blocksize=134217728



> hdfs dfs -stat 'name=%n blocksize=%o' /output/hello

name=hello blocksize=268435456

View solution in original post

2 REPLIES 2

avatar

There is no effective way to change block size "in place". The concept of block size is tightly tied to the on-disk layout of block files at DataNodes, so it's non-trivial to change this.

As far as running a distributed job to do this, it's possible to use distcp with an override of the block size on the command line. (See example below.) This does however cause a temporary doubling of the storage consumed.

> hadoop distcp -D dfs.blocksize=268435456 /input /output



> hdfs dfs -stat 'name=%n blocksize=%o' /input/hello

name=hello blocksize=134217728



> hdfs dfs -stat 'name=%n blocksize=%o' /output/hello

name=hello blocksize=268435456

avatar
Rising Star

So truly in-place is impossible---but it sounds like if the data were partitioned one could execute the distcp on one partition at a time, deleting each original partition after it is copied. Thanks man.