Created 11-01-2016 09:09 PM
CentOS 6.6
CDH 5.1.2
Due to space pressure, I need to reduce replication factor of existing files from 3 to 2.
A command like the following is executed
[hdfs]$ hdfs dfs -setrep -R -w 2 /path/of/files
A warning about "waiting time may be long for DECREASING the number of replications" appeared.
I am still waiting after tens of minutes. And fsck still showing over-replication.
[hdfs]$ hdfs fsck /path/of/files 16/11/02 12:04:42 WARN ssl.FileBasedKeyStoresFactory: The property 'ssl.client.truststore.location' has not been set, no TrustStore will be loaded Connecting to namenode via http://namenode1:50070 FSCK started by hdfs (auth:SIMPLE) from /192.168.88.38 for path /path/of/files at Wed Nov 02 12:04:43 HKT 2016 ....Status: HEALTHY Total size: 129643323 B Total dirs: 1 Total files: 4 Total symlinks: 0 Total blocks (validated): 4 (avg. block size 32410830 B) Minimally replicated blocks: 4 (100.0 %) Over-replicated blocks: 4 (75.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 3 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 6 Number of racks: 1 FSCK ended at Wed Nov 02 12:04:43 HKT 2016 in 1 milliseconds The filesystem under path ' /path/of/files' is HEALTHY
Is this normal? How long should the wait be?
Created on 11-01-2016 11:02 PM - edited 11-01-2016 11:03 PM
Answering my question...
The source code of org.apache.hadoop.hdfs.server.blockmanagement.BlockManager says
... if (numCurrentReplica > expectedReplication) { if (num.replicasOnStaleNodes() > 0) { // If any of the replicas of this block are on nodes that are // considered "stale", then these replicas may in fact have // already been deleted. So, we cannot safely act on the // over-replication until a later point in time, when // the "stale" nodes have block reported. return MisReplicationResult.POSTPONE; } ...
So the key point is whether the DataNodes are "stale". I don't know how to force the nodes to have block reported besides restarting. So I restarted all DataNode and over-replicated blocks gone.
Created 11-01-2016 09:20 PM
The setrep command just completed. However, the fsck still showing over-replication.
Created on 11-01-2016 11:02 PM - edited 11-01-2016 11:03 PM
Answering my question...
The source code of org.apache.hadoop.hdfs.server.blockmanagement.BlockManager says
... if (numCurrentReplica > expectedReplication) { if (num.replicasOnStaleNodes() > 0) { // If any of the replicas of this block are on nodes that are // considered "stale", then these replicas may in fact have // already been deleted. So, we cannot safely act on the // over-replication until a later point in time, when // the "stale" nodes have block reported. return MisReplicationResult.POSTPONE; } ...
So the key point is whether the DataNodes are "stale". I don't know how to force the nodes to have block reported besides restarting. So I restarted all DataNode and over-replicated blocks gone.