Support Questions

Find answers, ask questions, and share your expertise

Cluster Block Count - which is the real number?

avatar
Rising Star

Greetings, I would like to clear up my understanding of how the block count is measured for the cluster.

A bit of background information - we started receiving high block count warnings on Cloudera Manager (6.3), this lead to some investigating and cleanup. Currently I am trying to lower the block count on our DEV environment, but I am a bit confused.

On Cloudera Manager, when I navigate to the HDFS service, and look at the Health Tests I see that there are "...1,608,301 total blocks in the cluster." However, when I run:

 sudo -u hdfs hdfs fsck / -files -blocks -locations 

the summary at the end states that:

Replicated Blocks:
Total size: 53194573887 B (Total open files size: 1557592 B)
Total files: 569244 (Files currently being written: 202)
Total blocks (validated): 553524 (avg. block size 96101 B) (Total open file blocks (not validated): 193)
Minimally replicated blocks: 553524 (100.0 %)

Now this is fine, from what I understand, the fsck command shows the block count without the replication factor of 3, so the numbers add up roughly. However, when I perform the same comparison on our PROD environment, I get the following. On Cloudera Manager I see that there are "... 449,966 total blocks in the cluster.", yet the fsck command returns:

Replicated Blocks:
Total size: 317645298827 B (Total open files size: 2529389 B)
Total files: 389375 (Files currently being written: 142)
Total blocks (validated): 368223 (avg. block size 862643 B) (Total open file blocks (not validated): 130)
Minimally replicated blocks: 368223 (100.0 %)

Could someone please explain the discrepancy between the numbers in this case?

Thank you, kind regards,

Gyuszi

6 REPLIES 6

avatar
Moderator

Hello @matagyula,

 

Thank you for reaching out to community!

Please check if article https://community.cloudera.com/t5/Customer/DataNode-Block-Count-Threshold-alerts-are-displayed-in/ta... helps you on this issue.


Madhuri Adipudi, Technical Solutions Manager

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Learn more about the Cloudera Community:

avatar
Rising Star

Hi @Madhur ,

 

Thank you for your prompt response. Unfortunately I do not have a subscription with Cloudera, I am unable to access the Knowledge Base. So far we have managed to get by with the free version of CDH 6.3 and the help of the community on and off these forums 🙂

 

Kind regards,

Gyuszi Kovacs

avatar
Expert Contributor

@matagyula That does appear to be a discrepancy. There are a few things we can check for this.

 

1) Did you get the block numbers from the NameNode UI in both cases? If the information came from an alert, it may be out of date as old alerts are preserved.

2) In the PROD environment, are all of the DataNodes showing as online? You can get this information from the commandline using the following command:

$ hdfs dfsadmin -report

This should also include a block count; but the dfsadmin report will include the replicas, and identify incompletely replicated blocks as missing.

3) Is the replication factor the same in PROD as it is in the other environment?

 

The simplest explanation is that one or more DataNodes have been excluded from the count, but if the count came from an alert it may be inaccurate due to timing.

 

Regards,

Ryan Blough, COE

Cloudera Inc.

avatar
Rising Star

@rblough- thank you very much for your prompt reply.

 

1) I got the block numbers from the HDFS status page in Cloudera Manager. Based on your question I checked the numbers on the NameNode UI. For the DEV environment all three DataNodes are online, showing 1,613,019 blocks on each (CM shows 1,613,104). For the PROD environment the NameNodeUI shows 477,464 blocks on each of the three DataNodes.

 

2) Yes, all of the DataNodes are showing as online. dfsadmin -report confirms this, so does the NameNodeUI and CM. Coincidentally, the report did not include the total block count, just the number of missing or under replicated blocks - everything sits at zero.

 

3) The replication factor is set to 3 in both environments.

 

Kind regards,

Gyuszi Kovacs

avatar
Expert Contributor

@matagyula I suggest we attempt to get more information out of fsck in the PROD environment. This has two parts:


1) Use the options to get more detailed output about which blocks go where, and include snapshots.

$ hdfs fsck / -files -blocks -locations -includeSnapshots

 

This will break the results down into files, which blocks belong to which files, and where those files are located. Note: this will be a longer fsck, and induce a heavier load. Not recommended during peak load times.

 

2) Check the user who is running the fsck. We recommend running as the hdfs user, or another admin-level user.

 

Edit: hdfs fsck also ignores open files by default. Depending on your prod cluster's usage patterns and data structure, it is possible for a very large number of blocks of blocks to be open at once. You can include an option to include these in the count:

 

$ hdfs fsck / -openforwrite

I recommend this be done separately, before the heavier multi-option version above.

avatar
Rising Star

@rbloughThank you for the continued support.

 

2) The command is being run as the hdfs user.

 

1) The detailed output showed that there are 603,723 blocks in total. Looking at the HDFS UI, the Datanodes report having 586,426 blocks each.

 

3) hdfs fsck / -openforwrite says that there are 506,549 blocks in total.

 

The discrepancy in block count seems to be there still. Below are the summaries of the different fsck outputs.

 

hdfs fsck / -files -blocks -locations -includeSnapshots

Status: HEALTHY
Number of data-nodes: 3
Number of racks: 1
Total dirs: 64389
Total symlinks: 0

Replicated Blocks:
Total size: 330079817503 B (Total open files size: 235302 B)
Total files: 625308 (Files currently being written: 129)
Total blocks (validated): 603723 (avg. block size 546740 B) (Total open file blocks (not validated): 122)
Minimally replicated blocks: 603723 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Missing blocks: 0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Blocks queued for replication: 0

Erasure Coded Block Groups:
Total size: 0 B
Total files: 0
Total block groups (validated): 0
Minimally erasure-coded block groups: 0
Over-erasure-coded block groups: 0
Under-erasure-coded block groups: 0
Unsatisfactory placement block groups: 0
Average block group size: 0.0
Missing block groups: 0
Corrupt block groups: 0
Missing internal blocks: 0
Blocks queued for replication: 0
FSCK ended at Wed Sep 30 12:23:06 CEST 2020 in 23305 milliseconds

hdfs fsck / -openforwrite

Status: HEALTHY
Number of data-nodes: 3
Number of racks: 1
Total dirs: 63922
Total symlinks: 0

Replicated Blocks:
Total size: 329765860325 B
Total files: 528144
Total blocks (validated): 506549 (avg. block size 651004 B)
Minimally replicated blocks: 506427 (99.975914 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 2.9992774
Missing blocks: 0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Blocks queued for replication: 0

Erasure Coded Block Groups:
Total size: 0 B
Total files: 0
Total block groups (validated): 0
Minimally erasure-coded block groups: 0
Over-erasure-coded block groups: 0
Under-erasure-coded block groups: 0
Unsatisfactory placement block groups: 0
Average block group size: 0.0
Missing block groups: 0
Corrupt block groups: 0
Missing internal blocks: 0
Blocks queued for replication: 0
FSCK ended at Wed Sep 30 12:28:06 CEST 2020 in 11227 milliseconds