Created on 01-25-2017 07:50 AM - edited 01-25-2017 07:51 AM
1. A threshold of 500,000 or 1M seems like a "magic" number. Shouldn't it be a function of memory of node (Java Heap Size of DataNode in Bytes)?
Other interesting related questions:
2. What does a high block count indicate?
a. too many small files?
b. running out of capacity?
is it (a) or (b)? how to differentiate between the two?
3. What is a small file? A file whose size is smaller than block size (dfs.blocksize)?
4. Does each file take a new data block on disk? or is it the meta data associated with new file that is the problem?
5. The effects are more GC, declising execution speeds etc. How to "quantify" the effects of high block count?
Created 05-23-2017 07:38 AM
Thanks everyone for their input. I have done some research on the topic and share my findings.
1. any static number is a magic number. I propose the number of block threshold to be: heap memory (in gb) x 1 million * comfort_%age (say 50%)
Why? Rule of thumb: 1gb for 1M blocks, Cloudera [1]
The actual amount of heap memory required by namenode turns out to be much lower.
Heap needed = (number of blocks + inode (files + folders)) x object size (150-300 bytes [1,2])
For 1 million *small* files: heap needed = (1M + 1M) x 300b = 572mb <== much smaller than rule of thumb.
2. High block count may indicate both.
namenode UI states the heap capacity used.
For example,
http://namenode:50070/dfshealth.html#tab-overview
9,847,555 files and directories, 6,827,152 blocks = 16,674,707 total filesystem object(s).
Heap Memory used 5.82 GB of 15.85 GB Heap Memory. Max Heap Memory is 15.85 GB.
** Note, the heap memory used is still higher than 16,674,707 objects x 300 bytes = 4.65gb
To find out small files, do
hdfs fsck <path> -blocks | grep "Total blocks (validated):"
It would return something like:
Total blocks (validated): 2402 (avg. block size 325594 B) <== which is smaller than 1mb
3. yes. a file is small if its size < dfs.blocksize.
4.
* each file takes a new data block on disk, though the block size is close to file size. so small block.
* for every new file, inode type object is created (150B), so stress on heap memory of name node
Small files pose problems for both name node and data nodes:
name nodes:
- Pull the ceiling on number of files down as it needs to keep metadata for each file in memory
- Long time in restarting as it must read the metadata of every file from a cache on local disk
data nodes:
- large number of small files means a large amount of random disk IO. HDFS is designed for large files, and benefits from sequential reads.
[1] https://www.cloudera.com/documentation/enterprise/5-8-x/topics/admin_nn_memory_config.html
[2] https://martin.atlassian.net/wiki/pages/viewpage.action?pageId=26148906
Created 01-26-2017 09:29 PM
1. Yes it could. I personally don't like the threshold. It is not a great indicator of there being a small file issue.
2. The number reported by the DN is for all the replicas. It could mean a lot of small files or just a lot of data. At the defaults it could mean that the DN heap could use a boost although I always end up bumping it sooner.
3. Yes.
4. Yes. Each file takes up one or more blocks. The NN has to track it and it's replicas in its memeory. So a lot of small files can chew through the NN heap quickly. The DN heap is less concerned with Metadata associated with a block as it is related to the blocks being read,written, or replicated.
5. I'd worry less on the block count and more on the heap.
Created on 01-27-2017 01:09 AM - edited 01-27-2017 01:17 AM
We also do not like this "magic number" but we find it useful.
I think you should at least investigate your cluster when you have that warning in order to check that you do not have the "too many small files" issue.
Even if we are not satisfied with the threshold configured, it is still useful as a reminder (and should only be considered as such).
Having too many small files can also be performance-wise a bad thing regarding how map/reduce instanciate one separate mapper per block to read (if you use that data into jobs).
By the way, for investigation this I often use the "fsck" utility. When using with a path you can get the block count, the global size and the avg size of a block. This will let you know where in your HDFS storage you have too many small files or not.
When you have 200 000 blocks under a path with an average size of 2MB this should be a good indicator of having to many small files.
Created 05-23-2017 07:38 AM
Thanks everyone for their input. I have done some research on the topic and share my findings.
1. any static number is a magic number. I propose the number of block threshold to be: heap memory (in gb) x 1 million * comfort_%age (say 50%)
Why? Rule of thumb: 1gb for 1M blocks, Cloudera [1]
The actual amount of heap memory required by namenode turns out to be much lower.
Heap needed = (number of blocks + inode (files + folders)) x object size (150-300 bytes [1,2])
For 1 million *small* files: heap needed = (1M + 1M) x 300b = 572mb <== much smaller than rule of thumb.
2. High block count may indicate both.
namenode UI states the heap capacity used.
For example,
http://namenode:50070/dfshealth.html#tab-overview
9,847,555 files and directories, 6,827,152 blocks = 16,674,707 total filesystem object(s).
Heap Memory used 5.82 GB of 15.85 GB Heap Memory. Max Heap Memory is 15.85 GB.
** Note, the heap memory used is still higher than 16,674,707 objects x 300 bytes = 4.65gb
To find out small files, do
hdfs fsck <path> -blocks | grep "Total blocks (validated):"
It would return something like:
Total blocks (validated): 2402 (avg. block size 325594 B) <== which is smaller than 1mb
3. yes. a file is small if its size < dfs.blocksize.
4.
* each file takes a new data block on disk, though the block size is close to file size. so small block.
* for every new file, inode type object is created (150B), so stress on heap memory of name node
Small files pose problems for both name node and data nodes:
name nodes:
- Pull the ceiling on number of files down as it needs to keep metadata for each file in memory
- Long time in restarting as it must read the metadata of every file from a cache on local disk
data nodes:
- large number of small files means a large amount of random disk IO. HDFS is designed for large files, and benefits from sequential reads.
[1] https://www.cloudera.com/documentation/enterprise/5-8-x/topics/admin_nn_memory_config.html
[2] https://martin.atlassian.net/wiki/pages/viewpage.action?pageId=26148906
Created 05-23-2017 07:58 AM
Created 05-24-2017 02:03 PM
Created 05-24-2017 02:10 PM
Created 05-24-2017 02:17 PM
Thanks for your reply.
Any links or docs for storage pools will be helpful for me.
Created 05-24-2017 02:37 PM