Support Questions
Find answers, ask questions, and share your expertise

What factors warrant going to a higher hdfs block size (dfs.blocksize) than the default 128MB?

Solved Go to solution

What factors warrant going to a higher hdfs block size (dfs.blocksize) than the default 128MB?

Cloudera Employee

Also, did we recommend any customers going higher block size? if so, what were the observations to provide recommendations?

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: What factors warrant going to a higher hdfs block size (dfs.blocksize) than the default 128MB?

@snukavarapu It depends on how big the files are you are loading into HDFS. If the files are very big, a bigger block size would provide higher throughput. Bigger blocks would also equate to less blocks in HDFS which would reduce the load on the namenode.

You can specify blocksize for particular files too: hadoop fs -D fs.local.block.size=134217728 -put local_name remote_location

View solution in original post

3 REPLIES 3
Highlighted

Re: What factors warrant going to a higher hdfs block size (dfs.blocksize) than the default 128MB?

@snukavarapu It depends on how big the files are you are loading into HDFS. If the files are very big, a bigger block size would provide higher throughput. Bigger blocks would also equate to less blocks in HDFS which would reduce the load on the namenode.

You can specify blocksize for particular files too: hadoop fs -D fs.local.block.size=134217728 -put local_name remote_location

View solution in original post

Highlighted

Re: What factors warrant going to a higher hdfs block size (dfs.blocksize) than the default 128MB?

Cloudera Employee

@Andrew Watson - Thank you for quick response.

Unless customer dealing with special type of data:

Is it safe to assume 1) 128MB is optimal value for dfs.blocksize 2) we don't see this value being changed often?

Highlighted

Re: What factors warrant going to a higher hdfs block size (dfs.blocksize) than the default 128MB?

smaller blocks take up more space in the namenode tables, so in a large cluster, small blocks come at a price.

What small block sizes can do is allow for more workers to get at the data (half the blocksize == twice the bandwidth), but it also means that code that works with > 128MB of data isn't going to get all the data local to a machine, so more network traffic may occur. And, for apps that spin up fast, you may find that 128 MB blocks are streamed through fast enough that the overhead of scheduling containers and starting up the JVMs outweighs the extra bandwidth opportunities.

So the notion of "optimal size" isn't really so clear cut. If you've got a big cluster and you are running out of NN heap space, you're going to want to have a bigger block size whether or not your code likes it. Otherwise, it may depend on your data and the uses made of it.

As an experiment, try to save copies of the same data with different block sizes. Then see which is faster to query

Don't have an account?