Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

What is the procedure for re-replication of lost blocks in a situation of disk or datanode failure

avatar

I would like to know the procedure, and which java function are in charge, in the process of data re-replication when there is disk or datanode failure. Which process or functions guides the system? Who is the conductor of this process?

1 ACCEPTED SOLUTION

avatar

Commenting to clarify that some of the advice above is not wrong but it can be dangerous.

Starting with HDP 2.2 and later, the DataNode is more strict about where it expects block files to be. I do not recommend moving block files or folders on DataNodes around manually, unless you really know what you are doing.

@jovan karamacoski, to answer your original question - the NameNode drives the re-replication (specifically the BlockManager class within the NameNode). The ReplicationMonitor thread wakes up periodically and computes re-replication work for DataNodes.

The re-replication logic has multiple triggers like block reports, heartbeat timeouts, decommission etc.

View solution in original post

19 REPLIES 19

avatar
Master Guru

Hello jovan,

Yes you can simply move a folder. Data nodes are beautifully simple that way. We just did it on our cluster. Stop hdfs, copy the folder to a new location and change the location in the ambari configuration.

just try it with a single drive on a single node ( using ambari groups) ( you can do an hadoop fsck / to check for under replicated blocks after the test). A single drive will not lead to inconsistencies in any case.

In general data nodes do not care where the blocks are as long as they still find the files with the right block id in the data folders.

You can theoretically do it on a running cluster but you need to use ambari groups do it one server at a time and make sure you do it quickly so Namenode doesn't start to schedule large number of replica additions because of the missing data node ( hdfs waits a biy before it fixes under replication in case a data node just reboots)

avatar

@Benjamin Leonhardi

well, the first solution is stopping the HDFS that is not applicable in my case because it should be done on live system.

The second one is using Ambari groups, that again is manual task.

Lets now think about automation. For example, I want to move the folder but I need to set some mechanism, some function that will automate this process, by prioritization of the folders an setting kind of prizing for the servers.

Do you have idea how this could be done (having in mind that the Namenode have to be informed about the displacement in order not to disturb the Namenode with this movement)?

Is there any backward mechanism in the Datanodes that can send information about block locations to the Namenode? (I am asking about the possibility of backward mechanism, because I am aware that Namenode is in charge of the process of BlockPlacement by use of BlockPlacementPolicy mechanism)

avatar
Master Guru

@jovan karamacoski

I think you might want to contact us for a services engagement. I strongly suspect that what you want to achieve and what you asking about are not compatible.

On hadoop normally some files will be hot not specific blocks. And files will be per definition widely distributed across nodes. So moving specific "hot" drives will not make you happy. Also esp. If you write having some nodes with more network than others doesn't sound like a winning combination. Since slow nodes will be a bottleneck and it's all linked together. That's how hdfs works.

If you want some files to be faster you might want to look at hdfs storage tiering. Using that you could put "hot" data on fast storage like ssds. You could also look at node labels to put specific applications on fast nodes with lots of cpu etc. But moving single drives ??? That will not make you happy. Per definitely hdfs will not care. One balancer later and all your careful planning is gone.

Oh and lastly there is no online move of data nodes. You always need to stop a data node change the storage layout and start it again. It will send the updated block report to the Namenode.

avatar
@Benjamin Leonhardi

Well i think for further discussion the best place is a private chat or something similar. What is your suggestion?

avatar
Master Guru

Linkedin? There is only one Benjamin Leonhardi there

avatar
New Contributor

I following you but how can i see whos am following in chis community?

avatar

Commenting to clarify that some of the advice above is not wrong but it can be dangerous.

Starting with HDP 2.2 and later, the DataNode is more strict about where it expects block files to be. I do not recommend moving block files or folders on DataNodes around manually, unless you really know what you are doing.

@jovan karamacoski, to answer your original question - the NameNode drives the re-replication (specifically the BlockManager class within the NameNode). The ReplicationMonitor thread wakes up periodically and computes re-replication work for DataNodes.

The re-replication logic has multiple triggers like block reports, heartbeat timeouts, decommission etc.

avatar

@Arpit Agarwal

Thank you for the exact answer to my question. I need this particular answer. I just need to find if there is possibility to insert another trigger for re-replication, and if I can not find way to set new trigger I will try to tweak the reports somehow

avatar

Hi @jovan karamacoski, are you able to share what your overall goal is? The NameNode detects DataNode failures in ~10 minutes and queues re-replication work. Disk failures can take longer and we are planning to make improvements in this area soon.

The re-replication logic is complex. If you think your changes will be broadly useful please consider filing a bug in Apache HDFS Jira and submitting the changes as a patch. Best, Arpit.

avatar

@Arpit Agarwal

I am trying to find solution that will be part of my PhD. I want to create solution of one paradigm of 5G networks. At this moment I am learning and picking tails for the puzzle. I am afraid that my idea will be stolen 🙂 because I think that it would solve one big issue in the 5G networks. I am open to somehow share my idea but... I don't know how to be protected at the end. That's why I can not disclose my full idea.