Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to change the block size of existing files in HDFS?

avatar
Contributor

Hi,

I tried looking in to the community but could not get proper answer for this question.

How can I change the block size for the existing files in HDFS? I want to increase the block size.

I see the solution as distcp and I understood that we have to use distcp to move the files, folders and subfolders to a new temporary location with new block size and then remove the files, folders, etc for which block size has to be increased and copy the files from temporary location back to original location.

The above methodology might have side effects such as overhead of HDFS by adding duplicate copies of the files and change in permissions while copying files from temporary location and etc.

Is they any way which is efficient enough to replace the existing files with the same name and same privileges but with increased block size?

Thanks to all for your time on this question.

1 ACCEPTED SOLUTION

avatar
Master Mentor

@Sriram Hadoop

Once you have changed the block size at the cluster level, whatever files you put or copy to hdfs will have the new default block size of 256 MB

Unfortunately, apart from DISTCP you have the usual -put and -get HDFS commands

My default blocksize is 128MB see attached screenshot 128MB.JPG

Created a file test_128MB.txt

$ vi test_128MB.txt 

Uploaded a 128 MB files to HDFS

$ hdfs dfs -put test_128MB.txt /user/sheltong

see attached screenshot 128MB.JPG notice the block size

I then copied the same file back to the local filesystem,

$ hdfs dfs -get /user/sheltong/test_128MB.txt /tmp/test_128MB_2.txt

The using the -D option to define a new blocksize of 256 MB

$ hdfs dfs -D dfs.blocksize=268435456 -put test_128MB_2.txt /user/sheltong

See screenshot 256MB.JPG, technically its possible if you have a few files but you should remember the test_128MB.txt and test_128MB_2.txt are the same files of 128MB, so changing the blocksize of an existing files with try to fit a 128bock in a 256 MB block leading to wastage of space of the other 128MB, hence the reason it will ONLY apply to new files.

Hope that gives you a better understanding


128mb.jpg256mb.jpg

View solution in original post

14 REPLIES 14

avatar
New Contributor

Agreed, but is there a way to avoid this wastage. apart from migrating data to LFS and then again to HDFS.

Example: We have a 500MB file with block size 128 MB i.e. 4 blocks on HDFS. Now since we changed block size to 256MB, how would we make the file on HDFS to have 2 blocks of 256MB instead of 4.

Please suggest. 

avatar
Contributor

@Geoffery,

Thanks a lot for your time.

Yes, you are correct and I am looking for a tool other than distcp

Thanks a lot for your time on this again.

avatar
Master Mentor

@Sriram Hadoop

Nice to know it has answered your question.

Could you Accept the answer I gave by Clicking on Accept button below, That would be a great help to Community users to find the solution quickly for these kinds of errors.

avatar
Contributor
@Felix Albani

This is in regard to the changing the block size of an existing file from 64mb to 128mb.

We are facing some issues when we delete the files.

So, is there a way to change the block size of an existing file, without removing the file.

avatar
Explorer

@Geoffrey Shelton OkotI

Thanks for the details.

I have a cluster which has 220 million files and out of which 110 million is less than 1 MB in size.

Default block size is set to 128 MB.

What should be the blocksize for file less than 1 MB? And How we can set in live cluster?

Total Files + Directories: 227008030

Disk Remaining: 700 TB / 3.5 PB (20%)