Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

What factors warrant going to a higher hdfs block size (dfs.blocksize) than the default 128MB?

avatar
Contributor

Also, did we recommend any customers going higher block size? if so, what were the observations to provide recommendations?

1 ACCEPTED SOLUTION

avatar

@snukavarapu It depends on how big the files are you are loading into HDFS. If the files are very big, a bigger block size would provide higher throughput. Bigger blocks would also equate to less blocks in HDFS which would reduce the load on the namenode.

You can specify blocksize for particular files too: hadoop fs -D fs.local.block.size=134217728 -put local_name remote_location

View solution in original post

3 REPLIES 3

avatar

@snukavarapu It depends on how big the files are you are loading into HDFS. If the files are very big, a bigger block size would provide higher throughput. Bigger blocks would also equate to less blocks in HDFS which would reduce the load on the namenode.

You can specify blocksize for particular files too: hadoop fs -D fs.local.block.size=134217728 -put local_name remote_location

avatar
Contributor

@Andrew Watson - Thank you for quick response.

Unless customer dealing with special type of data:

Is it safe to assume 1) 128MB is optimal value for dfs.blocksize 2) we don't see this value being changed often?

avatar

smaller blocks take up more space in the namenode tables, so in a large cluster, small blocks come at a price.

What small block sizes can do is allow for more workers to get at the data (half the blocksize == twice the bandwidth), but it also means that code that works with > 128MB of data isn't going to get all the data local to a machine, so more network traffic may occur. And, for apps that spin up fast, you may find that 128 MB blocks are streamed through fast enough that the overhead of scheduling containers and starting up the JVMs outweighs the extra bandwidth opportunities.

So the notion of "optimal size" isn't really so clear cut. If you've got a big cluster and you are running out of NN heap space, you're going to want to have a bigger block size whether or not your code likes it. Otherwise, it may depend on your data and the uses made of it.

As an experiment, try to save copies of the same data with different block sizes. Then see which is faster to query