question Re: How can one change block size for large existing HDFS data set? in Support Questions

How can one change block size for large existing HDFS data set?

pcoates — Fri, 20 Nov 2015 05:40:37 GMT

I have hundreds of thousands of small data blocks (< 64MB) that I'd like to turn into a more manageable number of larger blocks, say, 128MB or 256MB. This is CSV data. How can I do this with a distributed job, and can it be done "in place", i.e., without temporarily doubling the space requirement?

Re: How can one change block size for large existing HDFS data set?

cnauroth — Fri, 20 Nov 2015 06:19:22 GMT

There is no effective way to change block size "in place". The concept of block size is tightly tied to the on-disk layout of block files at DataNodes, so it's non-trivial to change this.

As far as running a distributed job to do this, it's possible to use distcp with an override of the block size on the command line. (See example below.) This does however cause a temporary doubling of the storage consumed.

> hadoop distcp -D dfs.blocksize=268435456 /input /output



> hdfs dfs -stat 'name=%n blocksize=%o' /input/hello

name=hello blocksize=134217728



> hdfs dfs -stat 'name=%n blocksize=%o' /output/hello

name=hello blocksize=268435456

Re: How can one change block size for large existing HDFS data set?

pcoates — Fri, 20 Nov 2015 09:07:00 GMT

So truly in-place is impossible---but it sounds like if the data were partitioned one could execute the distcp on one partition at a time, deleting each original partition after it is copied. Thanks man.