Support Questions

Find answers, ask questions, and share your expertise

HDFS Missing blocks (with replication factor 1)

avatar
Contributor

Hello community! 

 

I recently added 4 more DNs to my Hadoop cluster, now there are 46 DNs up and running.

I'm running the balanncer since 5 days and today a message appears in the top of the NameNode web console (my_name_node_url:50070) about "There are 1169 missing blocks. The following files may be corrupted:" and a list of some of this corrupted blocks.

After I saw that message I decided to run the command "hdfs dfsadmin -report" and the result was:

 

Configured Capacity: 1706034579673088 (1.52 PB)
Present Capacity: 1683943231506526 (1.50 PB)
DFS Remaining: 559797934331658 (509.13 TB)
DFS Used: 1124145297174868 (1022.40 TB)
DFS Used%: 66.76%
Under replicated blocks: 1169
Blocks with corrupt replicas: 0
Missing blocks: 1169
Missing blocks (with replication factor 1): 21062

 

For storage capacity reasons a group of developers decided to avoid my advice and set the replication factor in 1 for some files.

What does "Missing blocks: 1169" means?

Is "Missing blocks  (with replication factor 1)" message telling that those 21062 blocks from files with replication factor 1 cannot be recover?

 

I'll be very grateful if anyone can clarify this concept.

 

Thanks!

 

Guido.

1 ACCEPTED SOLUTION

avatar
Contributor

Hello!

 

I tracked the missing blocks and fortunately they belonged to a decommisioned DN so I decided to remove them.

That's it!

 

Thanks for your help!

 

Guido.

View solution in original post

5 REPLIES 5

avatar
Champion
Yes, the missing blocks (with replication factor 1) means that those files are now corrupt and unrecoverable. The 1169 are listed as missing and under replicated. This means that they need to be replicated from the other replicas of those blocks on the cluster.

By default the minimum repl factor is 1 and the repl factor is 3. This means that if their are 2 replicas for a block, it will eventually be replicated, but not immediately. I believe the default value is that it will replicate them after after 1 hour (CDH) or 8 hours (Apache Hadoop). This provides some leeway for node outages without it flooding the cluster with replication operations. The cluster should recover. Please post if it does not.

avatar
Contributor

Thanks @mbigelow for clarifying this.

I've run the command (hdfs dfsadmin -report) as I did yesterday an the output is the same.

 

Configured Capacity: 1706034579673088 (1.52 PB)
Present Capacity: 1683460189648203 (1.50 PB)
DFS Remaining: 558365599904940 (507.83 TB)
DFS Used: 1125094589743263 (1023.27 TB)
DFS Used%: 66.83%
Under replicated blocks: 1169
Blocks with corrupt replicas: 0
Missing blocks: 1169
Missing blocks (with replication factor 1): 21062

  

I've a couple of questions that maybe you can help me with.

 

1) Is there a way to get rid of that message in the NameNode web console? 

 

2) There is a way to find out/list the missing files instead of the missing blocks?

 

3) Under replicated blocks staying steady in 1169, is CDH supposet to handle this?

An important thing I forgot to mention is that HBase is present in the cluster an there are 5 region servers, maybe this question fit better in a new post but as far as I know HBase and HDFS balancer don't like each other so I'm wondering if this situation can be the reason why CDH is not replicating the under replicated blocks.

 

Thanks again!

 

Guido.

 

 

avatar
Champion
1. yes, remove the corrupt files. Try the normal way, hdfs dfs -rm... If that doesn't work use hdfs fsck -move or -delete. The first will move the files to /lost+found the latter will remove them from the cluster. But to do that you need to know which files.

2. Use the command hdfs fsck <path> -list-corruptfileblocks -files -locations

3. Oh, I didn't notice this was in the HBase board. Can you expand on HBase's role in this issue? As that will effect the above answers (You don't want to be deleted HBase files through HDFS. HBase has its own version of fsck. Please run that and provide the output.

The balancer will not handle missing or under-replicated blocks. Its only deals with existing blocks.
HDFS should repair itself, but if this has to do with corrupt regions in HBase, then HDFS likely wont as HBase is more aware of the actual data.

Here is a Cloudera doc on the topic of HBase and corrupt regions.

https://www.cloudera.com/documentation/enterprise/5-4-x/topics/admin_hbck_poller.html

avatar
Contributor

Thanks @mbigelow!

 

I took a deep dive into those corrupt blocks and I realized that dont' belong to HBase tables, are just files from the HDFS.

I think I can understand more or less what is happening, please feel free to correct me in case I am wrong.

 

1) Under replicated blocks: 1169
2) Blocks with corrupt replicas: 0
3) Missing blocks: 1169
4) Missing blocks (with replication factor 1): 21062

 

1) I run "hdfs fsck / -list-corruptfileblocks" in order to find what files these blocks belong to. Then I listed those files and all of them had the replication factor in 1.

The replication factor by default in the cluster is 3 so no matter how much time I wait for HDFS to autommatically handle these under replicated blocks, they always be listed as under replicated. Am I wright? The cluster has a lots of files with replication factor in 1 too but were not listed as "under replicated", I can't understand why.

 

2) Nothing to agree.

 

3) Are missing from the entire cluster, are dead and there's no way to give them back without a backup. These blocks are the same as the under replicated ones in 1), my question here is...Why these files are not in "Missing blocks (with replication factor 1)"? or maybe they are but in this case why are there no more "under replicated" blocks?

 

4) No much to agree, clarifying 3) I'll better understand this.

 

Thanks again! 

 

 

avatar
Contributor

Hello!

 

I tracked the missing blocks and fortunately they belonged to a decommisioned DN so I decided to remove them.

That's it!

 

Thanks for your help!

 

Guido.