Support Questions

Find answers, ask questions, and share your expertise

Hadoop Data Node: why is there a "magic" number for threshold for data blocks?

avatar
Contributor

1. A threshold of 500,000 or 1M seems like a "magic" number. Shouldn't it be a function of memory of node (Java Heap Size of DataNode in Bytes)?

Other interesting related questions:

 

2. What does a high block count indicate?
a. too many small files?
b. running out of capacity?
is it (a) or (b)? how to differentiate between the two?

 

3. What is a small file? A file whose size is smaller than block size (dfs.blocksize)?

 

4. Does each file take a new data block on disk? or is it the meta data associated with new file that is the problem?

 

5. The effects are more GC, declising execution speeds etc. How to "quantify" the effects of high block count?

1 ACCEPTED SOLUTION

avatar
Contributor

Thanks everyone for their input. I have done some research on the topic and share my findings.

 

1. any static number is a magic number. I propose the number of block threshold to be: heap memory (in gb) x 1 million * comfort_%age (say 50%)

 

Why? Rule of thumb: 1gb for 1M blocks, Cloudera [1]

The actual amount of heap memory required by namenode turns out to be much lower.
Heap needed = (number of blocks + inode (files + folders)) x object size (150-300 bytes [1,2])

For 1 million *small* files: heap needed = (1M + 1M) x 300b = 572mb <== much smaller than rule of thumb.

 

2. High block count may indicate both.
namenode UI states the heap capacity used.

For example,
http://namenode:50070/dfshealth.html#tab-overview
9,847,555 files and directories, 6,827,152 blocks = 16,674,707 total filesystem object(s).
Heap Memory used 5.82 GB of 15.85 GB Heap Memory. Max Heap Memory is 15.85 GB.

 

** Note, the heap memory used is still higher than 16,674,707 objects x 300 bytes = 4.65gb

 

To find out small files, do
hdfs fsck <path> -blocks | grep "Total blocks (validated):"
It would return something like:
Total blocks (validated): 2402 (avg. block size 325594 B) <== which is smaller than 1mb

 

3. yes. a file is small if its size < dfs.blocksize.

 

4.
* each file takes a new data block on disk, though the block size is close to file size. so small block.
* for every new file, inode type object is created (150B), so stress on heap memory of name node


Small files pose problems for both name node and data nodes:
name nodes:
- Pull the ceiling on number of files down as it needs to keep metadata for each file in memory
- Long time in restarting as it must read the metadata of every file from a cache on local disk

data nodes:
- large number of small files means a large amount of random disk IO. HDFS is designed for large files, and benefits from sequential reads.

 

[1] https://www.cloudera.com/documentation/enterprise/5-8-x/topics/admin_nn_memory_config.html
[2] https://martin.atlassian.net/wiki/pages/viewpage.action?pageId=26148906

View solution in original post

8 REPLIES 8

avatar
Champion

1. Yes it could.  I personally don't like the threshold.  It is not a great indicator of there being a small file issue.

 

2. The number reported by the DN is for all the replicas.  It could mean a lot of small files or just a lot of data.  At the defaults it could mean that the DN heap could use a boost although I always end up bumping it sooner.

 

3. Yes.

 

4. Yes.  Each file takes up one or more blocks.  The NN has to track it and it's replicas in its memeory.  So a lot of small files can chew through the NN heap quickly.  The DN heap is less concerned with Metadata associated with a block as it is related to the blocks being read,written, or replicated.

 

5. I'd worry less on the block count and more on the heap.

avatar
Super Collaborator

We also do not like this "magic number" but we find it useful.

 

I think you should at least investigate your cluster when you have that warning in order to check that you do not have the "too many small files" issue.

Even if we are not satisfied with the threshold configured, it is still useful as a reminder (and should only be considered as such).

 

Having too many small files can also be performance-wise a bad thing regarding how map/reduce instanciate one separate mapper per block to read (if you use that data into jobs).

 

By the way, for investigation this I often use the "fsck" utility. When using with a path you can get the block count, the global size and the avg size of a block. This will let you know where in your HDFS storage you have too many small files or not.

When you have 200 000 blocks under a path with an average size of 2MB this should be a good indicator of having to many small files.

 

 

 

avatar
Contributor

Thanks everyone for their input. I have done some research on the topic and share my findings.

 

1. any static number is a magic number. I propose the number of block threshold to be: heap memory (in gb) x 1 million * comfort_%age (say 50%)

 

Why? Rule of thumb: 1gb for 1M blocks, Cloudera [1]

The actual amount of heap memory required by namenode turns out to be much lower.
Heap needed = (number of blocks + inode (files + folders)) x object size (150-300 bytes [1,2])

For 1 million *small* files: heap needed = (1M + 1M) x 300b = 572mb <== much smaller than rule of thumb.

 

2. High block count may indicate both.
namenode UI states the heap capacity used.

For example,
http://namenode:50070/dfshealth.html#tab-overview
9,847,555 files and directories, 6,827,152 blocks = 16,674,707 total filesystem object(s).
Heap Memory used 5.82 GB of 15.85 GB Heap Memory. Max Heap Memory is 15.85 GB.

 

** Note, the heap memory used is still higher than 16,674,707 objects x 300 bytes = 4.65gb

 

To find out small files, do
hdfs fsck <path> -blocks | grep "Total blocks (validated):"
It would return something like:
Total blocks (validated): 2402 (avg. block size 325594 B) <== which is smaller than 1mb

 

3. yes. a file is small if its size < dfs.blocksize.

 

4.
* each file takes a new data block on disk, though the block size is close to file size. so small block.
* for every new file, inode type object is created (150B), so stress on heap memory of name node


Small files pose problems for both name node and data nodes:
name nodes:
- Pull the ceiling on number of files down as it needs to keep metadata for each file in memory
- Long time in restarting as it must read the metadata of every file from a cache on local disk

data nodes:
- large number of small files means a large amount of random disk IO. HDFS is designed for large files, and benefits from sequential reads.

 

[1] https://www.cloudera.com/documentation/enterprise/5-8-x/topics/admin_nn_memory_config.html
[2] https://martin.atlassian.net/wiki/pages/viewpage.action?pageId=26148906

avatar
Champion
@BellRizz This is awesome. I hadn't thought to look at it from that angle. It makes sense though as the NN, based on its heap is expected to only be able to handle X number of blocks and so you can set the threshold on the DNs.

You should add "times 3" to your calculation though as the NN calc only accounts for single replicas. It stores the metadata for a single block which will include the location of the other two replicas but this is part of the 300 bytes already. So when it is saying that, as a rule of thumb, 1 GB for 1 million blocks, there will actually be 3 million total blocks at a default replication factor of 3. And maybe also divide the number by the number of DNs.

NN of 5 GB should handle upwards of 5 millions blocks, which is actually 15 million total. A 10 node cluster should set the DN block threshold to 1.5 million.

avatar
Contributor
NN of 5 GB should handle upwards of 5 millions blocks, which is actually 15 million total. A 10 node cluster should set the DN block threshold to 1.5 million.

-- This this hold good for a heterogeneous cluster where few data nodes have 40 TB space and others are 80TB space. I am sure having a datanode block threshold of 500,000 is not a good practise. This will cause smaller datandoes to fill up faster than the larger datanodes and send alerts at an early phase.

avatar
Champion
@naveen this is a threshold within Cloudera Manager and has no effect on how HDFS writes and replicas the data. This threshold just triggers a Warning and Critical alert in CM once it is exceeded.

A cluster with different sized nodes will always have the smaller DNs fill up faster as HDFS doesn't know better. You could use storage pools to manage it and keep the distribution in check.

avatar
Contributor

 

Thanks for your reply.

Any links or docs for storage pools will be helpful for me.

avatar
Champion
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html

I recommend opening a new topic you have any other questions on storage pools. That way this discussion can stay on topic.