Support Questions

pcoates · ‎11-19-2015

I have hundreds of thousands of small data blocks (< 64MB) that I'd like to turn into a more manageable number of larger blocks, say, 128MB or 256MB. This is CSV data. How can I do this with a distributed job, and can it be done "in place", i.e., without temporarily doubling the space requirement?

cnauroth · ‎11-19-2015

There is no effective way to change block size "in place". The concept of block size is tightly tied to the on-disk layout of block files at DataNodes, so it's non-trivial to change this.

As far as running a distributed job to do this, it's possible to use distcp with an override of the block size on the command line. (See example below.) This does however cause a temporary doubling of the storage consumed.

> hadoop distcp -D dfs.blocksize=268435456 /input /output



> hdfs dfs -stat 'name=%n blocksize=%o' /input/hello

name=hello blocksize=134217728



> hdfs dfs -stat 'name=%n blocksize=%o' /output/hello

name=hello blocksize=268435456

View solution in original post

cnauroth · ‎11-19-2015

There is no effective way to change block size "in place". The concept of block size is tightly tied to the on-disk layout of block files at DataNodes, so it's non-trivial to change this.

As far as running a distributed job to do this, it's possible to use distcp with an override of the block size on the command line. (See example below.) This does however cause a temporary doubling of the storage consumed.

> hadoop distcp -D dfs.blocksize=268435456 /input /output



> hdfs dfs -stat 'name=%n blocksize=%o' /input/hello

name=hello blocksize=134217728



> hdfs dfs -stat 'name=%n blocksize=%o' /output/hello

name=hello blocksize=268435456

pcoates · ‎11-20-2015

So truly in-place is impossible---but it sounds like if the data were partitioned one could execute the distcp on one partition at a time, deleting each original partition after it is copied. Thanks man.

Cloudera Community

Support Questions

How can one change block size for large existing HDFS data set?