Support Questions

GrazittiAPI · ‎03-04-2021

Hello,

I'm somewhat confused about the hdfs block count threshold configuration.

As a rule of thumb, I read about setting the threshold like

1 GB Java Heap Size of NameNode ~= 1M Blocks so with our setting of 10GB it would mean

the threshold would be 10M Blocks but then I read the replication factor ( x3 standard ) isn't included ?

So does this mean I have to set the threshold to 3,3M Blocks if I set the Java Heap Size of NameNode to 10GB ?

thx Martin

PabitraDas · ‎03-05-2021

Hello @uxadmin, Thank you for asking a follow-up question. Please note that, NameNode is responsible for keeping metadata of the files/blocks written into HDFS. Hence an increase in block count means NameNode has to keep more metadata information and may need more heap memory. As a thumb rule, we suggest 1GB of heap memory allocation for NameNode for every1 Million blocks in HDFS. Similarly, every 1Million block in DN requires ~1GB heap memory to operate smoothly.

As I said earlier, there is no hard limit to store blocks in DN but having too many blocks is an indication of small file accumulation in HDFS. You need to check the average block size in HDFS to understand if you are hitting small file issue.

Fsck should show the average block size. If it's too low a value (eg ~ 1MB), you might be hitting the problems of small files which would be worth looking at, otherwise, there is no need to review the number of blocks.

[..]

$ hdfs fsck /

..

...

Total blocks (validated): 2899 (avg. block size 11475601 B) <<<<<

[..]

In short, there is no limit for block count threshold for DN but an increase in block counts of DN is an early indicator of small files issue in cluster. Of course, more small files mean more heap memory requirement for both NN and DN.

In a perfect world where all files are created with 128MiB block size (default block size of HDFS), a 1 TB filesystem on DN can hold 8192 blocks (1024*1024/128). By that calculation, a DN with 23 TB can hold 188,416 blocks, but realistically we don't have all files created with 128MiB block and not all files occupy an entire block. So in a normal CDH cluster installation, we keep a minimal value of 500000 as a warning threshold for DN block counts. However, depending upon your use case and file write in HDFS, the block count may hit over a period of time. However, a value for the block count threshold can be determined by the data node disk size used for storing blocks.

Say you have allocated 10 numbers of 2TB disks (starting /data/1/dfs/dn to /data/10/dfs/dn) for block write in DataNode, which means 20TB is available to write blocks and if you are writing files with average block size of 10MB, it means you can accommodate maximum 2,097,152 blocks (20TB/10MB) on that DN. So a threshold value of 1M (1000000) is a good value to be set as the WArning threshold.

Hope this helps. Any question further, feel free to revert back.

Cheers!

In case your question has been answered, make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

View solution in original post

PabitraDas · ‎03-04-2021

Hello @uxadmin please note that block count threshold configuration is intended for DataNodes only.

This is a DataNode health test that checks for whether the DataNode has too many blocks. It's because having too many blocks on a DataNode may affect the DataNode's performance.

There's no hard limit on the # of blocks writable to a DN, as block size is merely a logical concept, not a physical layout. However, the block count alert serves to indicate an early warning to a growing number of small files issue. While your DN can handle a lot of blocks in general, going too high will cause performance issues. Your processing speeds may get lower if you keep a lot of tiny files on HDFS (depends on your use-case of course) so would be worth looking into.

You can find the block count threshold in HDFS config by navigating to CM > HDFS > Configuration > DataNode Block Count Thresholds

When the block counts on each DN goes above the threshold, CM triggers an alert. So you need to adjust the threshold value based on the block counts on each DN. You can determine the block counts on each DN, navigating to CM > HDFS > WebUI > Active NN > DataNodes tab > Block counts column under Datanode section.

Hope this helps.

uxadmin · ‎03-05-2021

Hello @PabitraDas ,

thanks for the clarification but may I ask you one more thing. The block count is a DN issue but also an NN issue as it has to keep the file info in memory right? So therefore we may not exceed a certain limit otherwise the cluster environment will suffer performance issues as far as I understand the context right.
So my question is: what would be a feasible formula to set this threshold value and which

values would influence the value number (like a number of NN disks ? Memory ? )

br

PabitraDas · ‎03-05-2021

Hello @uxadmin, Thank you for asking a follow-up question. Please note that, NameNode is responsible for keeping metadata of the files/blocks written into HDFS. Hence an increase in block count means NameNode has to keep more metadata information and may need more heap memory. As a thumb rule, we suggest 1GB of heap memory allocation for NameNode for every1 Million blocks in HDFS. Similarly, every 1Million block in DN requires ~1GB heap memory to operate smoothly.

As I said earlier, there is no hard limit to store blocks in DN but having too many blocks is an indication of small file accumulation in HDFS. You need to check the average block size in HDFS to understand if you are hitting small file issue.

Fsck should show the average block size. If it's too low a value (eg ~ 1MB), you might be hitting the problems of small files which would be worth looking at, otherwise, there is no need to review the number of blocks.

[..]

$ hdfs fsck /

..

...

Total blocks (validated): 2899 (avg. block size 11475601 B) <<<<<

[..]

In short, there is no limit for block count threshold for DN but an increase in block counts of DN is an early indicator of small files issue in cluster. Of course, more small files mean more heap memory requirement for both NN and DN.

In a perfect world where all files are created with 128MiB block size (default block size of HDFS), a 1 TB filesystem on DN can hold 8192 blocks (1024*1024/128). By that calculation, a DN with 23 TB can hold 188,416 blocks, but realistically we don't have all files created with 128MiB block and not all files occupy an entire block. So in a normal CDH cluster installation, we keep a minimal value of 500000 as a warning threshold for DN block counts. However, depending upon your use case and file write in HDFS, the block count may hit over a period of time. However, a value for the block count threshold can be determined by the data node disk size used for storing blocks.

Say you have allocated 10 numbers of 2TB disks (starting /data/1/dfs/dn to /data/10/dfs/dn) for block write in DataNode, which means 20TB is available to write blocks and if you are writing files with average block size of 10MB, it means you can accommodate maximum 2,097,152 blocks (20TB/10MB) on that DN. So a threshold value of 1M (1000000) is a good value to be set as the WArning threshold.

Hope this helps. Any question further, feel free to revert back.

Cheers!

In case your question has been answered, make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

uxadmin · ‎03-05-2021

Hello @PabitraDas ,

that's the information I needed thanks a lot!

br

Cloudera Community

Support Questions

Block Count threshold configuration