Created 10-10-2014 07:59 AM
In my HDFS status summary in Cloudera Manager, I see the followign messages about missing and underreplicated blocks:
The 2 corrupt files are the following:
hdfs@sandy-quad-1:~$ hdfs fsck -list-corruptfileblocks 14/10/10 16:51:59 WARN ssl.FileBasedKeyStoresFactory: The property 'ssl.client.truststore.location' has not been set, no TrustStore will be loaded Connecting to namenode via http://sandy-quad-1.sslab.lan:50070 The list of corrupt files under path '/' are: blk_1074173133 /user/history/done/2014/10/07/000001/job_1412322902461_1076-1412674739294-bart-hadoop%2Dmapreduce%2Dclient%2Djobclient%2D2.3.0%2Dcdh5.1.2%2Dt-1412674771395-10-1-SUCCEEDED-root.bart-1412674749886.jhist blk_1074173134 /user/history/done/2014/10/07/000001/job_1412322902461_1076_conf.xml The filesystem under path '/' has 2 CORRUPT files
What is the best way to fix these two corrupt files and also fix the underreplicated block problem?
Created 10-10-2014 08:09 AM
I was able to remove the corrupt files using
hdfs@sandy-quad-1:~$ hdfs fsck / -delete
Now I still need to find out how to fix the 'Under-Replicated Blocks' problem...
Created 10-10-2014 08:09 AM
I was able to remove the corrupt files using
hdfs@sandy-quad-1:~$ hdfs fsck / -delete
Now I still need to find out how to fix the 'Under-Replicated Blocks' problem...
Created 10-20-2014 01:51 PM
Created 10-20-2015 09:52 PM
van you please explain in detail
Created 10-10-2016 08:31 AM
Providing some additional detail information for later reference.
Manikumar's notes above only pertain to under replicated blocks, and not to missing blocks as the original problem statement.
Missing blocks are ones where the Namenode determines that _all_ copies of the blocks are missing from the environment.
While under replicated blocks are when the Namenode determines that some of the copies of the blocks are missing from the environment.
As mentioned above, the under replicated blocks should be automatically recovered by HDFS. The Namenode coordinates the increase in replication for a block through the Datanodes.
Under replicated blocks often occur with hardware failure, and it can take some amount of time to replicate all of the blocks to another disk, or Datanode.
There are a couple of methods to monitor under replicated blocks.
1) For clusters with Cloudera Manager installed:
Click on the "Charts" link at the top of the screen
Click on "Chart Builder"
use the following query: "select under_replicated_blocks;"
This will display a plot over time of the under replicated blocks.
If this value is decreasing, just continue to monitor the value until it drops to 0, and make sure that all Datanodes are healthy and available.
2) For clusters without Cloudera Manager
The Namenode tracks the under replicated blocks through it's web ui in two ways:
http://namenode.example.com:50070/dfshealth.html#tab-overview and look for "Under-Replicated" or
http://namenode.example.com:50070/jmx and look for "UnderReplicatedBlocks"
* The ports and locations will change for your cluster.
Running a balancer, will not change replication of blocks. The Namenode will ask Datanodes to transfer blocks based upon the average disk utilization of the cluster
compared to the average disk utilization of the node. The balancer is typically limited in throughput to enable balancing as a background task, while normal recovery of
under replicated blocks happens at an unrestricted rate.
If the under replicated blocks are not decreasing, but staying steady, then more investigation is necessary.
Here are some questions to ask:
Is this a small cluster? ( 3 nodes, under 10 ). If so:
- Is the default replication greater than the number of alive Datanodes?
- Is the value of mapreduce.client.submit.file.replication lower than the number of Datanodes configured?
When a mapreduce job runs, it will attempt to ensure that files are copied to the cluster with mapreduce.client.submit.file.replication copies.
If this is larger than the number of nodes that you have in the cluster, then you will always have under replicated blocks.
Is the cluster larger? if so:
- Is the network unhealthy?
If the Datanodes are frequently out of touch with the cluster, then the Namenode may be marking blocks as wrongly under replicated.
http://namenode.example.com:50070/dfshealth.html#tab-datanode will have information regarding last time that the Namenode was contacted by the Datanode.
Work with your networking team to validate the environment, and make sure that any top of rack switches or any other networking hardware is healthy and not over subscribed.
- Are there racks configured in the cluster? Is one rack entirely down?
This will cause under replicated blocks that might be impossible to resolve.
HDFS will not store all three block replicas within one rack. If you have only two racks, and one is down, then under replication will be impossible to resolve until the rack is healthy again.
Is the problem limited to specific files?
The default replication configured through Cloudera Manager, or through hdfs-site.xml in non-Cloudera Manager installations only determines the default.
Individual users are able to change replication when any file is created.
This is unusual, but may happen.
The following command will show all files that are not open. Look for "Target Replicas is X but found Y replica(s)"
hdfs fsck / -files
If X is larger than the number of available nodes, or different than the default replication, then you will be able to change the replication of that file.
hdfs dfs -setrep 3 /path/to/strangefile
( Also note that "hdfs dfs -ls -R /" will show desired replication for a file.
Also "hdfs fsck / -blocks -files -locations" provides a very detailed view of all of the blocks of your cluster. Any of these commands may take a long time in a large cluster. )
Created 07-04-2017 12:54 PM
You mentioned that you still need to fix the 'Under-Replicated Blocks'.
This is what I found with google to fix:
$ su - <$hdfs_user>
$ hdfs fsck / | grep 'Under replicated' | awk -F':' '{print $1}' >> /tmp/under_replicated_files
$ for hdfsfile in `cat /tmp/under_replicated_files`; do echo "Fixing $hdfsfile :" ; hadoop fs -setrep 3 $hdfsfile; done