Created 02-05-2018 08:35 AM
In HDFS what is small file problem ?
Created 02-05-2018 06:41 PM
HDFS is very good at caching all file names and block addresses in memory. This is done at the Namenode level. This makes HDFS incredibly fast. So if you make modifications to the file system or read a file location, all of this can be served with no disk I/Os.
This design choice of keeping all metadata in memory at all times has certain trade-offs. One of them is that we need to spend a couple of bytes (think 100s of bytes) per file and blocks.
This leads to an issue in HDFS -- when you have a file system with 500 million to 700 million -- the amount of RAM that needs to be reserved by the Namenode becomes large. Typically, in sizes of 256 GB or more. At this size, the JVM is hard at work too; since it has to do things like garbage collection. There is also another dimension to this when you have 700 million files, it is quite possible that your cluster is serving 30-40K or more requests per second. This also creates lots of memory churn.
So a large number of files, combined with lots of file system requests makes Namenode a bottleneck in HDFS or in other words, the metadata that we need to keep in memory creates a bottleneck in HDFS.
There are several solutions / work in progress to address this problem --
Please let me know if you have any other questions.
Thanks
Anu
Created 02-05-2018 09:33 AM
The HDFS is a distributed file system. hadoop is mainly designed for batch processing of large volume of data. The default data block size of HDFS is 128 MB. When file size is significantly smaller than the block size the efficiency degrades.
Mainly there are two reasons for producing small files:
To understand HDFS block size in more detail, I'd recommend reviewing a few good stackoverflow questions:http://stackoverflow.com/questions/13012924/large-block-size-in-hdfs-how-is-the-unused-space-account...
http://stackoverflow.com/questions/19473772/data-block-size-in-hdfs-why-64mb
For your disk/filesystem recommendations take a look here:
Hope that all helps!
duplicate topic.
Created 02-05-2018 06:41 PM
HDFS is very good at caching all file names and block addresses in memory. This is done at the Namenode level. This makes HDFS incredibly fast. So if you make modifications to the file system or read a file location, all of this can be served with no disk I/Os.
This design choice of keeping all metadata in memory at all times has certain trade-offs. One of them is that we need to spend a couple of bytes (think 100s of bytes) per file and blocks.
This leads to an issue in HDFS -- when you have a file system with 500 million to 700 million -- the amount of RAM that needs to be reserved by the Namenode becomes large. Typically, in sizes of 256 GB or more. At this size, the JVM is hard at work too; since it has to do things like garbage collection. There is also another dimension to this when you have 700 million files, it is quite possible that your cluster is serving 30-40K or more requests per second. This also creates lots of memory churn.
So a large number of files, combined with lots of file system requests makes Namenode a bottleneck in HDFS or in other words, the metadata that we need to keep in memory creates a bottleneck in HDFS.
There are several solutions / work in progress to address this problem --
Please let me know if you have any other questions.
Thanks
Anu