Created 09-14-2014 10:57 PM
The HDFS block size in my system is set to be 128m. Does it mean that if I put 8 files less than 128m to HDFS, they would occupy 3G disk space (replication factor = 3) ?
When I use "hadoop fs -count ", it only show the size of files. How could I know the actual occupied space of HDFS file ?
And how about I use HAR to archive these 8 files ? Can it save some space ?
Created 09-15-2014 01:44 AM
The block on the file system isn't a fixed size file with padding, rather it is just a unit of storage. The block's size can be maximum of 128MB (or as configured), so if a file is smaller, it will just occupy the minimum needed space.
In my previous response, I had said 8 small files would take up 3GB of space. This is incorrect. The space taken up on the cluster is still just the file size times 3 for each block. Regardless of file size, you can divide the size by the block size (default 128M) and round up to the next whole number, this will give you the number of blocks. So in this case, the 3922 byte file uses one block to store the contents.
Created on 09-15-2014 12:12 AM - edited 09-15-2014 12:12 AM
> The HDFS block size in my system is set to be 128m. Does it mean that
> if I put 8 files less than 128m to HDFS, they would occupy 3G disk
> space (replication factor = 3) ?
Yes, this is right. HDFS blocks are not shared among files.
> How could I know the actual occupied space of HDFS file ?
The -ls command tells you this. In the example below, the jar file is
3922 bytes long.
# sudo -u hdfs hadoop fs -ls /user/oozie/share/lib/sqoop/hive-builtins-0.10.0-cdh4.7.0.jar
-rw-r--r-- 3 oozie oozie 3922 2014-09-14 06:17 /user/oozie/share/lib/sqoop/hive-builtins-0.10.0-cdh4.7.0.jar
> And how about I use HAR to archive these 8 files ? Can it save some
> space ?
Using HAR is a good idea. More ideas about dealing with the small files
problem is in this link
http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
Created 09-15-2014 01:25 AM
Thanks for your reply.
The -ls command tells me the size of the file, but what I want to know is the occupied disk space. The jar file is 3922 bytes long, but it actually occupy one HDFS block (128M) according to your first anwser. Is it right?
Is there any way I can check the actual occupied space?
Created 09-15-2014 01:44 AM
The block on the file system isn't a fixed size file with padding, rather it is just a unit of storage. The block's size can be maximum of 128MB (or as configured), so if a file is smaller, it will just occupy the minimum needed space.
In my previous response, I had said 8 small files would take up 3GB of space. This is incorrect. The space taken up on the cluster is still just the file size times 3 for each block. Regardless of file size, you can divide the size by the block size (default 128M) and round up to the next whole number, this will give you the number of blocks. So in this case, the 3922 byte file uses one block to store the contents.
Created 09-15-2014 02:34 AM
Thanks so much for resolving my long time confusion!
I know that HAR can lead to smaller metadata, however, I still do not understand why HAR can save disk space.
8 1m size files would occupy 8 1m HDFS blocks, and the disk space used is 24m. HAR combines these files into a 8m har file occupying one 8m block, but the disk space used is still 24m. Or is any kind of compression used in HAR?
Created 09-15-2014 02:41 AM
Created 09-15-2014 02:43 AM
Ok, thanks for your patient help
Created 09-15-2014 01:32 AM
If I use HAR to archive these 8 files, would they be placed into one HDFS block (assuming that they are all less than 1m) ?
If it is true, I can save 7/8 disk space in this case.
Created 09-15-2014 01:48 AM