Support Questions

mvssaravana · ‎04-06-2018

Hi,

I've setup the hadoop cluster and I've data in it. Its default data blocksize is 128MB and now I want to change the blocksize as 256MB. My question is, how to get the existing dataset overwritten from 128MB to 256MB.

Shelton · ‎04-06-2018

@Saravana V

You need to understand what is the benefit of having a larger block size. Your HDFS block is 128 MB will be written to disk sequentially. When you write the data sequentially there is a fair chance that the data will be written into contiguous space on disk which means that data will be written next to each other in a continuous fashion. When a data is laid out in the disk in continuous fashion it reduces the number of disk seeks during the read operation resulting in an efficient read. So that is why block size in HDFS is huge when compared to the other file systems.

There is no effective way to change block size "in place". The concept of block size is tightly tied to the on-disk layout of block files at DataNodes, so it's non-trivial to change this.

When you changing the block size from one value to other then only the files which are ingested/created in HDFS will be created with new block size. Where as the old files will remain to exists in the previous block size only and it will not changed. If you need to change then manual intervention is needed.

Hope that helps

View solution in original post

Shelton · ‎04-06-2018

@Saravana V

You need to understand what is the benefit of having a larger block size. Your HDFS block is 128 MB will be written to disk sequentially. When you write the data sequentially there is a fair chance that the data will be written into contiguous space on disk which means that data will be written next to each other in a continuous fashion. When a data is laid out in the disk in continuous fashion it reduces the number of disk seeks during the read operation resulting in an efficient read. So that is why block size in HDFS is huge when compared to the other file systems.

There is no effective way to change block size "in place". The concept of block size is tightly tied to the on-disk layout of block files at DataNodes, so it's non-trivial to change this.

When you changing the block size from one value to other then only the files which are ingested/created in HDFS will be created with new block size. Where as the old files will remain to exists in the previous block size only and it will not changed. If you need to change then manual intervention is needed.

Hope that helps

mvssaravana · ‎04-06-2018

@Geoffrey Shelton Okot Thanks for your comment and it helps. Please could you share the manual steps to make the change of blocksize of stored dataset in existing cluster if you have handy. would be really helpful for practice.

Shelton · ‎04-06-2018

@Saravana V

To change the block size, parameter, dfs.block.size can be changed to required value(default in hadoop 2.0 is 128mb 256mb in hdfs-site.xml file. Once this is changed through Ambari UI the ONLY recommended way, the cluster restart is required for the change to effect, for which will be applied only to the new files.

Change this setting and restart all stale configurations see 256.JPG desktop

--Created a directory for the test if it does not exist

Copied a file specifying the block size to 256 MB see new_file256.JPG

$ hdfs dfs -D dfs.blocksize=268435456 -put /tmp/ambari.properties.4 /user/sheltong/test

Copied the new files to the same directory of files size 128 see new2_file256.JPG

$ hdfs dfs -D dfs.blocksize=268435456 -put /tmp/ambari.properties.4 /user/sheltong

DISTCP see (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery and reporting.

To overwrite the old 128MB files copy the new files with the new block size to the new location adding the option overwrite. However, we have to manually delete the old files with the older block size. Command:

$ hadoop distcp -Ddfs.block.size=268435456 /path/to/data(Source) /path/to/data-with-largeblocks(Destination).

Now the question becomes should I make my dataset 128 MB or 256 MB or even more? It all depends on your cluster capacity and the size of your datasets. Let's say you have a dataset which is 2 Petabytes in size. Having a 64 MB block size for this dataset will result in 31 million+ blocks which would put stress on the NameNode to manage all that blocks. Having a lot of blocks will also result in a lot of mappers during MapReduce execution. So, in this case, you may decide to increase the block size just for that dataset.

Hope that helps

patrasuman · ‎12-17-2018

@Geoffrey Shelton Okot

What you have mentioned is for any specific path.

Do you have any procedure through which we can change the block size of existing cluster which includes all old files ?

Please post it here if any you are aware!!

Cloudera Community

Support Questions

Change the blocksize in existing cluster