Support Questions
Find answers, ask questions, and share your expertise

What is the procedure for re-replication of lost blocks in a situation of disk or datanode failure

I would like to know the procedure, and which java function are in charge, in the process of data re-replication when there is disk or datanode failure. Which process or functions guides the system? Who is the conductor of this process?

1 ACCEPTED SOLUTION

Commenting to clarify that some of the advice above is not wrong but it can be dangerous.

Starting with HDP 2.2 and later, the DataNode is more strict about where it expects block files to be. I do not recommend moving block files or folders on DataNodes around manually, unless you really know what you are doing.

@jovan karamacoski, to answer your original question - the NameNode drives the re-replication (specifically the BlockManager class within the NameNode). The ReplicationMonitor thread wakes up periodically and computes re-replication work for DataNodes.

The re-replication logic has multiple triggers like block reports, heartbeat timeouts, decommission etc.

View solution in original post

19 REPLIES 19

The namenode has a list of all the files blocks and block replicas in memory. A gigantic hashtable. Datanodes send block reports to it to give it an overview of all the blocks in the system. Periodically the namenode checks if all blocks have the desired replication level. If not it schedules either block deletion ( if the replication level is too high which can happen if a node crashed and was re added to the cluster ) or block copies.

@Benjamin Leonhardi

I know this abbreviated procedure, but I can not find the detailed procedure when there is under-replicated situation (the under-replication can be done by intentional or unintentional folder or data replacement)

Not sure what you mean. Do you want to know WHY blocks get under replicated? There are different possibilities for a block to vanish but by and large its simple:

a) The block replica doesn't get written in the first place

This happens during network or node failure during a write. HDFS will still return the write of a block as successful as long as one of the block replica writes was successful . So if for example the third designated datanode dies during the write process the write is still successful but the block will be under replicated. The write process doesn't care and they depend on the Namenode to schedule a copy later on.

b) The block replicas get deleted later.

That can have lots of different reasons. Node dies, drive dies, you delete a block file in the drive. Blocks after all are simple bog standard Linux files with a name blkxxxx which is the block id. They can also get corrupted ( HDFS does CRC checks regularly and blocks that are corrupted will be replaced with a healthy copy. And many more ...

So perhaps you should be a bit more specific with your question?

@Benjamin Leonhardi

I give better idea of what is my intention in the other reply (read above to get my point).

@jovan karamacoski

If you want to manually force the blocks to replicate to fix under-replication, you can use the method outlined in this article.

@emaxwell

The idea is next. I have folder with data in Rack1/Disk1/MyFolder. I want to manually delete this folder and set a copy in Rack2/DiskX/Myfolder. Is this possible? Your suggestion is useful when I will fix the underreplication manually but for entire filesystem (as I can understand), but my intention is to manipulate only part of the information, same folder or bunch of files. Is there any function to manipulate the location of the folders manually?

An HDFS rebalance should optimize how the files are distributed across your network. Is there a particular reason why you want to manually determine where the replicas are stored?

@emaxwell

yes, i have reason. The content of the folder is important. I would like to set that content to reside on a newer disk with higher read/write speed and on a machine with better network interface (higher throughput). I would like to decrease the access time if it is possible (few miliseconds decrease of the access time is of great importance for me).

@jovan karamacoski

You can enable multiple tiers of storage and specify where files should be stored to control. Check out the following link:

http://hortonworks.com/blog/heterogeneous-storages-hdfs/

If you really need to control which nodes that data goes to as well, you can only set up the faster storage on the faster nodes. This is not recommended because it will lead to an imbalance on the cluster, but it is possible. to do.

Hello jovan,

Yes you can simply move a folder. Data nodes are beautifully simple that way. We just did it on our cluster. Stop hdfs, copy the folder to a new location and change the location in the ambari configuration.

just try it with a single drive on a single node ( using ambari groups) ( you can do an hadoop fsck / to check for under replicated blocks after the test). A single drive will not lead to inconsistencies in any case.

In general data nodes do not care where the blocks are as long as they still find the files with the right block id in the data folders.

You can theoretically do it on a running cluster but you need to use ambari groups do it one server at a time and make sure you do it quickly so Namenode doesn't start to schedule large number of replica additions because of the missing data node ( hdfs waits a biy before it fixes under replication in case a data node just reboots)

@Benjamin Leonhardi

well, the first solution is stopping the HDFS that is not applicable in my case because it should be done on live system.

The second one is using Ambari groups, that again is manual task.

Lets now think about automation. For example, I want to move the folder but I need to set some mechanism, some function that will automate this process, by prioritization of the folders an setting kind of prizing for the servers.

Do you have idea how this could be done (having in mind that the Namenode have to be informed about the displacement in order not to disturb the Namenode with this movement)?

Is there any backward mechanism in the Datanodes that can send information about block locations to the Namenode? (I am asking about the possibility of backward mechanism, because I am aware that Namenode is in charge of the process of BlockPlacement by use of BlockPlacementPolicy mechanism)

@jovan karamacoski

I think you might want to contact us for a services engagement. I strongly suspect that what you want to achieve and what you asking about are not compatible.

On hadoop normally some files will be hot not specific blocks. And files will be per definition widely distributed across nodes. So moving specific "hot" drives will not make you happy. Also esp. If you write having some nodes with more network than others doesn't sound like a winning combination. Since slow nodes will be a bottleneck and it's all linked together. That's how hdfs works.

If you want some files to be faster you might want to look at hdfs storage tiering. Using that you could put "hot" data on fast storage like ssds. You could also look at node labels to put specific applications on fast nodes with lots of cpu etc. But moving single drives ??? That will not make you happy. Per definitely hdfs will not care. One balancer later and all your careful planning is gone.

Oh and lastly there is no online move of data nodes. You always need to stop a data node change the storage layout and start it again. It will send the updated block report to the Namenode.

@Benjamin Leonhardi

Well i think for further discussion the best place is a private chat or something similar. What is your suggestion?

Linkedin? There is only one Benjamin Leonhardi there

New Contributor

I following you but how can i see whos am following in chis community?

Commenting to clarify that some of the advice above is not wrong but it can be dangerous.

Starting with HDP 2.2 and later, the DataNode is more strict about where it expects block files to be. I do not recommend moving block files or folders on DataNodes around manually, unless you really know what you are doing.

@jovan karamacoski, to answer your original question - the NameNode drives the re-replication (specifically the BlockManager class within the NameNode). The ReplicationMonitor thread wakes up periodically and computes re-replication work for DataNodes.

The re-replication logic has multiple triggers like block reports, heartbeat timeouts, decommission etc.

@Arpit Agarwal

Thank you for the exact answer to my question. I need this particular answer. I just need to find if there is possibility to insert another trigger for re-replication, and if I can not find way to set new trigger I will try to tweak the reports somehow

Hi @jovan karamacoski, are you able to share what your overall goal is? The NameNode detects DataNode failures in ~10 minutes and queues re-replication work. Disk failures can take longer and we are planning to make improvements in this area soon.

The re-replication logic is complex. If you think your changes will be broadly useful please consider filing a bug in Apache HDFS Jira and submitting the changes as a patch. Best, Arpit.

@Arpit Agarwal

I am trying to find solution that will be part of my PhD. I want to create solution of one paradigm of 5G networks. At this moment I am learning and picking tails for the puzzle. I am afraid that my idea will be stolen 🙂 because I think that it would solve one big issue in the 5G networks. I am open to somehow share my idea but... I don't know how to be protected at the end. That's why I can not disclose my full idea.

; ;