question Re: small files problem in Archives of Support Questions (Read Only)

small files problem

sky88088 — Mon, 15 Sep 2014 05:57:10 GMT

The HDFS block size in my system is set to be 128m. Does it mean that if I put 8 files less than 128m to HDFS, they would occupy 3G disk space (replication factor = 3) ?

When I use "hadoop fs -count ", it only show the size of files. How could I know the actual occupied space of HDFS file ?

And how about I use HAR to archive these 8 files ? Can it save some space ?

Re: small files problem

GautamG — Mon, 15 Sep 2014 07:12:39 GMT

> The HDFS block size in my system is set to be 128m. Does it mean that
> if I put 8 files less than 128m to HDFS, they would occupy 3G disk
> space (replication factor = 3) ?

Yes, this is right. HDFS blocks are not shared among files.

> How could I know the actual occupied space of HDFS file ?

The -ls command tells you this. In the example below, the jar file is
3922 bytes long.

# sudo -u hdfs hadoop fs -ls /user/oozie/share/lib/sqoop/hive-builtins-0.10.0-cdh4.7.0.jar
-rw-r--r-- 3 oozie oozie 3922 2014-09-14 06:17 /user/oozie/share/lib/sqoop/hive-builtins-0.10.0-cdh4.7.0.jar

> And how about I use HAR to archive these 8 files ? Can it save some
> space ?

Using HAR is a good idea. More ideas about dealing with the small files
problem is in this link
http://blog.cloudera.com/blog/2009/02/the-small-files-problem/

Re: small files problem

sky88088 — Mon, 15 Sep 2014 08:25:18 GMT

Thanks for your reply.

The -ls command tells me the size of the file, but what I want to know is the occupied disk space. The jar file is 3922 bytes long, but it actually occupy one HDFS block (128M) according to your first anwser. Is it right?

Is there any way I can check the actual occupied space?

Re: small files problem

sky88088 — Mon, 15 Sep 2014 08:32:17 GMT

If I use HAR to archive these 8 files, would they be placed into one HDFS block (assuming that they are all less than 1m) ?

If it is true, I can save 7/8 disk space in this case.

Re: small files problem

GautamG — Mon, 15 Sep 2014 08:44:36 GMT

The block on the file system isn't a fixed size file with padding, rather it is just a unit of storage. The block's size can be maximum of 128MB (or as configured), so if a file is smaller, it will just occupy the minimum needed space.

In my previous response, I had said 8 small files would take up 3GB of space. This is incorrect. The space taken up on the cluster is still just the file size times 3 for each block. Regardless of file size, you can divide the size by the block size (default 128M) and round up to the next whole number, this will give you the number of blocks. So in this case, the 3922 byte file uses one block to store the contents.

Re: small files problem

GautamG — Mon, 15 Sep 2014 08:48:01 GMT

If you use HAR to combine 8 smaller files (each less than 1M), it would
occupy just one block. More than disk space saved, you save on metadata
storage (on the namenode and datanodes) and this is far more significant in
the long term for performance.

Re: small files problem

sky88088 — Mon, 15 Sep 2014 09:34:08 GMT

Thanks so much for resolving my long time confusion!

I know that HAR can lead to smaller metadata, however, I still do not understand why HAR can save disk space.

8 1m size files would occupy 8 1m HDFS blocks, and the disk space used is 24m. HAR combines these files into a 8m har file occupying one 8m block, but the disk space used is still 24m. Or is any kind of compression used in HAR?

Re: small files problem

GautamG — Mon, 15 Sep 2014 09:41:01 GMT

The advantage of using HAR files is not in saving of disk space but in
lesser metadata. Please read the blog link I pasted earlier.

quote:

===

A small file is one which is significantly smaller than the HDFS block size
(default 64MB). If you’re storing small files, then you probably have lots
of them (otherwise you wouldn’t turn to Hadoop), and the problem is that
HDFS can’t handle lots of files.

Every file, directory and block in HDFS is represented as an object in the
namenode’s memory, each of which occupies 150 bytes, as a rule of thumb
. So
10 million files, each using a block, would use about 3 gigabytes of
memory. Scaling up much beyond this level is a problem with current
hardware. Certainly a billion files is not feasible.

Furthermore, HDFS is not geared up to efficiently accessing small files: it
is primarily designed for streaming access of large files. Reading through
small files normally causes lots of seeks and lots of hopping from datanode
to datanode to retrieve each small file, all of which is an inefficient
data access pattern.
===

Re: small files problem

sky88088 — Mon, 15 Sep 2014 09:43:04 GMT

Ok, thanks for your patient help