What is best way of handling corrupt or missing blocks?
You can use the command - hdfs fsck / -delete to list corrupt of missing blocks and then follow the article above to fix the same.
"The next step would be to determine the importance of the file, can it just be removed and copied back into place, or is there sensitive data that needs to be regenerated?
If it's easy enough just to replace the file, that's the route I would take."
To identify "corrupt" or "missing" blocks, the command-line command 'hdfs fsck /path/to/file' can be used. Other tools also exist.
HDFS will attempt to recover the situation automatically. By default there are three replicas of any block in the cluster. so if HDFS detects that one replica of a block has become corrupt or damaged, HDFS will create a new replica of that block from a known-good replica, and will mark the damaged one for deletion.
The known-good state is determined by checksums which are recorded alongside the block by each DataNode.
The chances of two replicas of the same block becoming damaged is very small indeed. HDFS can - and does - recover from this situation because it has a third replica, with its checksum, from which further replicas can be created.
The chances of three replicas of the same block becoming damaged is so remote that it would suggest a significant failure somewhere else in the cluster. If this situation does occur, and all three replicas are damaged, then 'hdfs fsck' will report that block as "corrupt" - i.e. HDFS cannot self-heal the block from any of its replicas.
Rebuilding the data behind a corrupt block is a lengthy process (like any data recovery process). If this situation should arise, deep investigation of the health of the cluster as a whole should also be undertaken.