Created 01-11-2016 07:21 PM
I have a requirement to periodically restart all cluster nodes at the machine level. Assume I've done an FSCK before starting to confirm that all blocks are fully replicated. Question is, as I restart each node in turn, will the NameNode notice that any block on that node is under-replicated and put those blocks on the replication queue? If this does happen, will it automatically remove those blocks when the data node comes back online and reports it's blocks to the NN? Note, this is a hardware restart, so the Ambari rolling restart doesn't do the job.
Created 01-11-2016 08:45 PM
A background thread in the NameNode scans a replication queue and schedules work on specific DataNodes to repair under- (or over-) replicated blocks based on the items in that queue. This replication queue is populated by a different background thread that monitors heartbeat status of every DataNode. If the heartbeat monitor thread detects that a DataNode has entered "dead" state, then it remove its record of replicas living on that DataNode. If this causes a block to be considered under-replicated, then that block is submitted to the replication queue.
Under typical configuration, a DataNode is considered dead approximately 10 minutes after receipt of its last heartbeat at the NameNode. This is governed by configuration properties dfs.heartbeat.interval and dfs.namenode.heartbeat.recheck-interval, so if these configuration properties have been tuned for some reason, then my assumption of 10 minutes no longer holds.
<property> <name>dfs.heartbeat.interval</name> <value>3</value> <description>Determines datanode heartbeat interval in seconds.</description> </property> <property> <name>dfs.namenode.heartbeat.recheck-interval</name> <value>300000</value> <description> This time decides the interval to check for expired datanodes. With this value and dfs.heartbeat.interval, the interval of deciding the datanode is stale or not is also calculated. The unit of this configuration is millisecond. </description> </property>
Until that time has passed, the NameNode will not queue replication work associated with that DataNode.
Bottom line: barring any unusual configuration tunings, it's a race for the node to restart in less than 10 minutes. Replication work will not get queued unless the node fails to restart within that time limit.
Whatever your plans for implementing this restart, I recommend testing before a full production roll-out.
Created 01-11-2016 07:30 PM
I think it depends on staleness property of your hdfs-site configuration. If it takes a long time to reboot a server, namenode will mark the DN as stale. Take a look at this property,
Created 01-11-2016 08:10 PM
The three staleness properties control how long it will take for nodes that have not been heard from are regarded as stale, and whether to read or write to such nodes. I don't think that's what we're looking for.
What I'm asking is whether it is necessary to avoid replicating blocks on nodes that are temporarily offline. I found the property dfs.namenode.replication.interval which is described as "controlling the periodicity with which the NN computed replication work for data nodes." It sounds like bumping it up temporarily might work. Opinion?
Created 01-11-2016 08:45 PM
A background thread in the NameNode scans a replication queue and schedules work on specific DataNodes to repair under- (or over-) replicated blocks based on the items in that queue. This replication queue is populated by a different background thread that monitors heartbeat status of every DataNode. If the heartbeat monitor thread detects that a DataNode has entered "dead" state, then it remove its record of replicas living on that DataNode. If this causes a block to be considered under-replicated, then that block is submitted to the replication queue.
Under typical configuration, a DataNode is considered dead approximately 10 minutes after receipt of its last heartbeat at the NameNode. This is governed by configuration properties dfs.heartbeat.interval and dfs.namenode.heartbeat.recheck-interval, so if these configuration properties have been tuned for some reason, then my assumption of 10 minutes no longer holds.
<property> <name>dfs.heartbeat.interval</name> <value>3</value> <description>Determines datanode heartbeat interval in seconds.</description> </property> <property> <name>dfs.namenode.heartbeat.recheck-interval</name> <value>300000</value> <description> This time decides the interval to check for expired datanodes. With this value and dfs.heartbeat.interval, the interval of deciding the datanode is stale or not is also calculated. The unit of this configuration is millisecond. </description> </property>
Until that time has passed, the NameNode will not queue replication work associated with that DataNode.
Bottom line: barring any unusual configuration tunings, it's a race for the node to restart in less than 10 minutes. Replication work will not get queued unless the node fails to restart within that time limit.
Whatever your plans for implementing this restart, I recommend testing before a full production roll-out.
Created 01-12-2016 06:56 PM
One last detail---if the time runs out, and the blocks go on the queue for replication, what happens when the node comes back online and reports. Are they stricken from the queue? What if they've already been replicated?
Created 01-11-2016 08:58 PM
@Peter Coates, the default DataNode heartbeat expiry is 630 seconds. If the total DN downtime (including startup time) is longer, the NN will trigger re-replication.
The heartbeatExpiry Interval is calculated from `dfs.namenode.heartbeat.recheck-interval` (default 300 seconds) and `dfs.heartbeat.interval` (default 3 seconds) as follows.
heartbeatExpireInterval = 2 * heartbeatRecheckInterval + 10 * heartbeatIntervalSeconds;