Support Questions

athtsang · ‎11-01-2016

CentOS 6.6
CDH 5.1.2

Due to space pressure, I need to reduce replication factor of existing files from 3 to 2.

A command like the following is executed

[hdfs]$ hdfs dfs -setrep -R -w 2  /path/of/files

A warning about "waiting time may be long for DECREASING the number of replications" appeared.

I am still waiting after tens of minutes. And fsck still showing over-replication.

[hdfs]$ hdfs fsck /path/of/files
16/11/02 12:04:42 WARN ssl.FileBasedKeyStoresFactory: The property 'ssl.client.truststore.location' has not been set, no TrustStore will be loaded
Connecting to namenode via http://namenode1:50070
FSCK started by hdfs (auth:SIMPLE) from /192.168.88.38 for path /path/of/files at Wed Nov 02 12:04:43 HKT 2016
....Status: HEALTHY
 Total size:	129643323 B
 Total dirs:	1
 Total files:	4
 Total symlinks:		0
 Total blocks (validated):	4 (avg. block size 32410830 B)
 Minimally replicated blocks:	4 (100.0 %)
 Over-replicated blocks:	4 (75.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	3
 Average block replication:	3
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		6
 Number of racks:		1
FSCK ended at Wed Nov 02 12:04:43 HKT 2016 in 1 milliseconds


The filesystem under path ' /path/of/files' is HEALTHY

Is this normal? How long should the wait be?

athtsang · ‎11-01-2016

Answering my question...

The source code of org.apache.hadoop.hdfs.server.blockmanagement.BlockManager says

...
    if (numCurrentReplica > expectedReplication) {
      if (num.replicasOnStaleNodes() > 0) {
        // If any of the replicas of this block are on nodes that are
        // considered "stale", then these replicas may in fact have
        // already been deleted. So, we cannot safely act on the
        // over-replication until a later point in time, when
        // the "stale" nodes have block reported.
        return MisReplicationResult.POSTPONE;
      }
...

So the key point is whether the DataNodes are "stale". I don't know how to force the nodes to have block reported besides restarting. So I restarted all DataNode and over-replicated blocks gone.

View solution in original post

athtsang · ‎11-01-2016

The setrep command just completed. However, the fsck still showing over-replication.

athtsang · ‎11-01-2016

Answering my question...

The source code of org.apache.hadoop.hdfs.server.blockmanagement.BlockManager says

...
    if (numCurrentReplica > expectedReplication) {
      if (num.replicasOnStaleNodes() > 0) {
        // If any of the replicas of this block are on nodes that are
        // considered "stale", then these replicas may in fact have
        // already been deleted. So, we cannot safely act on the
        // over-replication until a later point in time, when
        // the "stale" nodes have block reported.
        return MisReplicationResult.POSTPONE;
      }
...

So the key point is whether the DataNodes are "stale". I don't know how to force the nodes to have block reported besides restarting. So I restarted all DataNode and over-replicated blocks gone.

Cloudera Community

Support Questions

Fixing Over-replicated Blocks

Fix Under-replicated blocks in HDFS manually

How to fix missing and underreplicated blocks?

Explaining "block missing" and "block corruption" ...

How to fix missing and under replicated blocks?

How to fix corrupt blocks

Fix under replicated blocks very slow

after -setrep from 3 to 2, Over-replicated blocks ...

Fixing Ambari-Kafka Alert Errors

Steps to fix Ambari-server & agent expired certs

The Untold Story of Block Access Token