Support Questions

Find answers, ask questions, and share your expertise

What is the procedure for re-replication of lost blocks in a situation of disk or datanode failure

avatar

I would like to know the procedure, and which java function are in charge, in the process of data re-replication when there is disk or datanode failure. Which process or functions guides the system? Who is the conductor of this process?

1 ACCEPTED SOLUTION

avatar

Commenting to clarify that some of the advice above is not wrong but it can be dangerous.

Starting with HDP 2.2 and later, the DataNode is more strict about where it expects block files to be. I do not recommend moving block files or folders on DataNodes around manually, unless you really know what you are doing.

@jovan karamacoski, to answer your original question - the NameNode drives the re-replication (specifically the BlockManager class within the NameNode). The ReplicationMonitor thread wakes up periodically and computes re-replication work for DataNodes.

The re-replication logic has multiple triggers like block reports, heartbeat timeouts, decommission etc.

View solution in original post

19 REPLIES 19

avatar
Master Guru

The namenode has a list of all the files blocks and block replicas in memory. A gigantic hashtable. Datanodes send block reports to it to give it an overview of all the blocks in the system. Periodically the namenode checks if all blocks have the desired replication level. If not it schedules either block deletion ( if the replication level is too high which can happen if a node crashed and was re added to the cluster ) or block copies.

avatar

@Benjamin Leonhardi

I know this abbreviated procedure, but I can not find the detailed procedure when there is under-replicated situation (the under-replication can be done by intentional or unintentional folder or data replacement)

avatar
Master Guru

Not sure what you mean. Do you want to know WHY blocks get under replicated? There are different possibilities for a block to vanish but by and large its simple:

a) The block replica doesn't get written in the first place

This happens during network or node failure during a write. HDFS will still return the write of a block as successful as long as one of the block replica writes was successful . So if for example the third designated datanode dies during the write process the write is still successful but the block will be under replicated. The write process doesn't care and they depend on the Namenode to schedule a copy later on.

b) The block replicas get deleted later.

That can have lots of different reasons. Node dies, drive dies, you delete a block file in the drive. Blocks after all are simple bog standard Linux files with a name blkxxxx which is the block id. They can also get corrupted ( HDFS does CRC checks regularly and blocks that are corrupted will be replaced with a healthy copy. And many more ...

So perhaps you should be a bit more specific with your question?

avatar

@Benjamin Leonhardi

I give better idea of what is my intention in the other reply (read above to get my point).

avatar

@jovan karamacoski

If you want to manually force the blocks to replicate to fix under-replication, you can use the method outlined in this article.

avatar

@emaxwell

The idea is next. I have folder with data in Rack1/Disk1/MyFolder. I want to manually delete this folder and set a copy in Rack2/DiskX/Myfolder. Is this possible? Your suggestion is useful when I will fix the underreplication manually but for entire filesystem (as I can understand), but my intention is to manipulate only part of the information, same folder or bunch of files. Is there any function to manipulate the location of the folders manually?

avatar

An HDFS rebalance should optimize how the files are distributed across your network. Is there a particular reason why you want to manually determine where the replicas are stored?

avatar

@emaxwell

yes, i have reason. The content of the folder is important. I would like to set that content to reside on a newer disk with higher read/write speed and on a machine with better network interface (higher throughput). I would like to decrease the access time if it is possible (few miliseconds decrease of the access time is of great importance for me).

avatar

@jovan karamacoski

You can enable multiple tiers of storage and specify where files should be stored to control. Check out the following link:

http://hortonworks.com/blog/heterogeneous-storages-hdfs/

If you really need to control which nodes that data goes to as well, you can only set up the faster storage on the faster nodes. This is not recommended because it will lead to an imbalance on the cluster, but it is possible. to do.