<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: small files problem in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/small-files-problem/m-p/18746#M2926</link>
    <description>&lt;P&gt;&amp;gt; The HDFS block size in my system is set to be 128m. Does it mean that&lt;BR /&gt;&amp;gt; if I put 8 files less than 128m to HDFS, they would occupy 3G disk&lt;BR /&gt;&amp;gt; space (replication factor = 3) ?&lt;BR /&gt;&lt;BR /&gt;Yes, this is right. HDFS blocks are not shared among files.&lt;BR /&gt;&lt;BR /&gt;&amp;gt; How could I know the actual occupied space of HDFS file ?&lt;BR /&gt;&lt;BR /&gt;The -ls command tells you this. In the example below, the jar file is&lt;BR /&gt;3922 bytes long.&lt;BR /&gt;&lt;BR /&gt;# sudo -u hdfs hadoop fs -ls /user/oozie/share/lib/sqoop/hive-builtins-0.10.0-cdh4.7.0.jar&lt;BR /&gt;-rw-r--r--&amp;nbsp;&amp;nbsp; 3 oozie oozie&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 3922 2014-09-14 06:17 /user/oozie/share/lib/sqoop/hive-builtins-0.10.0-cdh4.7.0.jar&lt;BR /&gt;&lt;BR /&gt;&amp;gt; And how about I use HAR to archive these 8 files ? Can it save some&lt;BR /&gt;&amp;gt; space ?&lt;BR /&gt;&lt;BR /&gt;Using HAR is a good idea. More ideas about dealing with the small files&lt;BR /&gt;problem is in this link&lt;BR /&gt;&lt;A href="http://blog.cloudera.com/blog/2009/02/the-small-files-problem/" target="_blank"&gt;http://blog.cloudera.com/blog/2009/02/the-small-files-problem/&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 15 Sep 2014 07:12:39 GMT</pubDate>
    <dc:creator>GautamG</dc:creator>
    <dc:date>2014-09-15T07:12:39Z</dc:date>
    <item>
      <title>small files problem</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/small-files-problem/m-p/18744#M2925</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The HDFS block size in my system is set to be 128m. Does it mean that if I put 8 files less than 128m to HDFS, they would occupy 3G disk space (replication factor = 3) ?&lt;/P&gt;&lt;P&gt;When I use "hadoop fs -count ", it only show the size of files. How could I know the actual occupied space of HDFS file ?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;And how about I use HAR to archive these 8 files ? Can it save some space ?&lt;/P&gt;</description>
      <pubDate>Mon, 15 Sep 2014 05:57:10 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/small-files-problem/m-p/18744#M2925</guid>
      <dc:creator>sky88088</dc:creator>
      <dc:date>2014-09-15T05:57:10Z</dc:date>
    </item>
    <item>
      <title>Re: small files problem</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/small-files-problem/m-p/18746#M2926</link>
      <description>&lt;P&gt;&amp;gt; The HDFS block size in my system is set to be 128m. Does it mean that&lt;BR /&gt;&amp;gt; if I put 8 files less than 128m to HDFS, they would occupy 3G disk&lt;BR /&gt;&amp;gt; space (replication factor = 3) ?&lt;BR /&gt;&lt;BR /&gt;Yes, this is right. HDFS blocks are not shared among files.&lt;BR /&gt;&lt;BR /&gt;&amp;gt; How could I know the actual occupied space of HDFS file ?&lt;BR /&gt;&lt;BR /&gt;The -ls command tells you this. In the example below, the jar file is&lt;BR /&gt;3922 bytes long.&lt;BR /&gt;&lt;BR /&gt;# sudo -u hdfs hadoop fs -ls /user/oozie/share/lib/sqoop/hive-builtins-0.10.0-cdh4.7.0.jar&lt;BR /&gt;-rw-r--r--&amp;nbsp;&amp;nbsp; 3 oozie oozie&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 3922 2014-09-14 06:17 /user/oozie/share/lib/sqoop/hive-builtins-0.10.0-cdh4.7.0.jar&lt;BR /&gt;&lt;BR /&gt;&amp;gt; And how about I use HAR to archive these 8 files ? Can it save some&lt;BR /&gt;&amp;gt; space ?&lt;BR /&gt;&lt;BR /&gt;Using HAR is a good idea. More ideas about dealing with the small files&lt;BR /&gt;problem is in this link&lt;BR /&gt;&lt;A href="http://blog.cloudera.com/blog/2009/02/the-small-files-problem/" target="_blank"&gt;http://blog.cloudera.com/blog/2009/02/the-small-files-problem/&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 15 Sep 2014 07:12:39 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/small-files-problem/m-p/18746#M2926</guid>
      <dc:creator>GautamG</dc:creator>
      <dc:date>2014-09-15T07:12:39Z</dc:date>
    </item>
    <item>
      <title>Re: small files problem</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/small-files-problem/m-p/18748#M2927</link>
      <description>&lt;P&gt;Thanks for your reply.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The -ls command tells me the size of the file, but what I want to know is the occupied disk space. The jar file is 3922 bytes long, but it actually occupy one HDFS block (128M) according to your first anwser. Is it right?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is there any way I can check the actual occupied space?&amp;nbsp;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 15 Sep 2014 08:25:18 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/small-files-problem/m-p/18748#M2927</guid>
      <dc:creator>sky88088</dc:creator>
      <dc:date>2014-09-15T08:25:18Z</dc:date>
    </item>
    <item>
      <title>Re: small files problem</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/small-files-problem/m-p/18750#M2928</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;If I use HAR to archive these 8 files, would they be placed into one HDFS block (assuming that they are all less than 1m) ?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;If it is true, I can save 7/8 disk space in this case.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 15 Sep 2014 08:32:17 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/small-files-problem/m-p/18750#M2928</guid>
      <dc:creator>sky88088</dc:creator>
      <dc:date>2014-09-15T08:32:17Z</dc:date>
    </item>
    <item>
      <title>Re: small files problem</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/small-files-problem/m-p/18752#M2929</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The block on the file system isn't a fixed size file with padding, rather it is just a unit of storage. The block's size can be maximum of 128MB (or as configured), so if a file is smaller, it will just occupy the minimum needed space.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In my previous response, I had said 8 small files would take up 3GB of space. This is incorrect. The space taken up on the cluster is still just the file size times 3 for each block. Regardless of file size, you can divide the size by the block size (default 128M) and round up to the next whole number, this will give you the number of blocks. So in this case, the 3922 byte file uses one block to store the contents.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 15 Sep 2014 08:44:36 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/small-files-problem/m-p/18752#M2929</guid>
      <dc:creator>GautamG</dc:creator>
      <dc:date>2014-09-15T08:44:36Z</dc:date>
    </item>
    <item>
      <title>Re: small files problem</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/small-files-problem/m-p/18754#M2930</link>
      <description>If you use HAR to combine 8 smaller files (each less than 1M), it would&lt;BR /&gt;occupy just one block. More than disk space saved, you save on metadata&lt;BR /&gt;storage (on the namenode and datanodes) and this is far more significant in&lt;BR /&gt;the long term for performance.&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Mon, 15 Sep 2014 08:48:01 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/small-files-problem/m-p/18754#M2930</guid>
      <dc:creator>GautamG</dc:creator>
      <dc:date>2014-09-15T08:48:01Z</dc:date>
    </item>
    <item>
      <title>Re: small files problem</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/small-files-problem/m-p/18756#M2931</link>
      <description>&lt;P&gt;Thanks so much for resolving my long time confusion!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I know that HAR can lead to smaller metadata, however, I still do not understand why HAR can save disk space.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;8 1m size files would occupy 8 1m HDFS blocks, and the disk space used is 24m. HAR combines these files into a 8m har file occupying one 8m block, but the disk space used is still 24m. Or is any kind of compression used in HAR?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 15 Sep 2014 09:34:08 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/small-files-problem/m-p/18756#M2931</guid>
      <dc:creator>sky88088</dc:creator>
      <dc:date>2014-09-15T09:34:08Z</dc:date>
    </item>
    <item>
      <title>Re: small files problem</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/small-files-problem/m-p/18758#M2932</link>
      <description>The advantage of using HAR files is not in saving of disk space but in&lt;BR /&gt;lesser metadata. Please read the blog link I pasted earlier.&lt;BR /&gt;&lt;BR /&gt;quote:&lt;BR /&gt;&lt;BR /&gt;===&lt;BR /&gt;&lt;BR /&gt;A small file is one which is significantly smaller than the HDFS block size&lt;BR /&gt;(default 64MB). If you’re storing small files, then you probably have lots&lt;BR /&gt;of them (otherwise you wouldn’t turn to Hadoop), and the problem is that&lt;BR /&gt;HDFS can’t handle lots of files.&lt;BR /&gt;&lt;BR /&gt;Every file, directory and block in HDFS is represented as an object in the&lt;BR /&gt;namenode’s memory, each of which occupies 150 bytes, as a rule of thumb&lt;BR /&gt;. So&lt;BR /&gt;10 million files, each using a block, would use about 3 gigabytes of&lt;BR /&gt;memory. Scaling up much beyond this level is a problem with current&lt;BR /&gt;hardware. Certainly a billion files is not feasible.&lt;BR /&gt;&lt;BR /&gt;Furthermore, HDFS is not geared up to efficiently accessing small files: it&lt;BR /&gt;is primarily designed for streaming access of large files. Reading through&lt;BR /&gt;small files normally causes lots of seeks and lots of hopping from datanode&lt;BR /&gt;to datanode to retrieve each small file, all of which is an inefficient&lt;BR /&gt;data access pattern.&lt;BR /&gt;===&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Mon, 15 Sep 2014 09:41:01 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/small-files-problem/m-p/18758#M2932</guid>
      <dc:creator>GautamG</dc:creator>
      <dc:date>2014-09-15T09:41:01Z</dc:date>
    </item>
    <item>
      <title>Re: small files problem</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/small-files-problem/m-p/18760#M2933</link>
      <description>&lt;P&gt;Ok, thanks for your patient help&lt;/P&gt;</description>
      <pubDate>Mon, 15 Sep 2014 09:43:04 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/small-files-problem/m-p/18760#M2933</guid>
      <dc:creator>sky88088</dc:creator>
      <dc:date>2014-09-15T09:43:04Z</dc:date>
    </item>
  </channel>
</rss>

