Created 11-19-2015 09:40 PM
I have hundreds of thousands of small data blocks (< 64MB) that I'd like to turn into a more manageable number of larger blocks, say, 128MB or 256MB. This is CSV data. How can I do this with a distributed job, and can it be done "in place", i.e., without temporarily doubling the space requirement?
Created 11-19-2015 10:19 PM
There is no effective way to change block size "in place". The concept of block size is tightly tied to the on-disk layout of block files at DataNodes, so it's non-trivial to change this.
As far as running a distributed job to do this, it's possible to use distcp with an override of the block size on the command line. (See example below.) This does however cause a temporary doubling of the storage consumed.
> hadoop distcp -D dfs.blocksize=268435456 /input /output > hdfs dfs -stat 'name=%n blocksize=%o' /input/hello name=hello blocksize=134217728 > hdfs dfs -stat 'name=%n blocksize=%o' /output/hello name=hello blocksize=268435456
Created 11-19-2015 10:19 PM
There is no effective way to change block size "in place". The concept of block size is tightly tied to the on-disk layout of block files at DataNodes, so it's non-trivial to change this.
As far as running a distributed job to do this, it's possible to use distcp with an override of the block size on the command line. (See example below.) This does however cause a temporary doubling of the storage consumed.
> hadoop distcp -D dfs.blocksize=268435456 /input /output > hdfs dfs -stat 'name=%n blocksize=%o' /input/hello name=hello blocksize=134217728 > hdfs dfs -stat 'name=%n blocksize=%o' /output/hello name=hello blocksize=268435456
Created 11-20-2015 01:07 AM
So truly in-place is impossible---but it sounds like if the data were partitioned one could execute the distcp on one partition at a time, deleting each original partition after it is copied. Thanks man.