Support Questions
Find answers, ask questions, and share your expertise

Drive failure for Datanode

I have couple of worker datanodes with multiple drive mount points in each for hdfs. One of these mountpoints failed and took place when the cluster was offline. To avoid any problems, the cluster was brought up without starting the ambari-agent or the other services on this node with the failed mount point back online.

I was wondering what is the best way to reintegrate this node back to the cluster. Will there be any issues or dataloss if only the failed mount point is replaced and the the ambari-agent and other services on the node are started up? Or is there any particular approach to follow?




in our environment (in fact in any environment), this is very common and we always face these disk failure issues and we usually do not shutdown that node, instead we note down the node details for replacing the faulty drive at a later point of time. But when a drive fails, the Namenode identifies this fault drive and also the missing blocks and it usually take care of these missing blocks by copying them from another datanode/drive. In my opinion, you don't have to really worry for a faulty disk on a datanode and you can bring back the node and integrate with cluster, so that the cluster can use the remaining good disks on that node.


If the node is brought back online with a new drive put instead of the failed one and the services started, will it cause any issues to the existing data that is on the cluster and has changed?


No, you brought the node with a new drive, at that time, NN sees new working drive,it will start allocating new blocks to the new drive. But in any of the above operations, I don’t see any reason for corrupting or deleting data on other drives or on other nodes.

This happens all the time in any production environments.


Each block of data has at least 3 replicas across the other nodes (depending on your configuration). In your particular case when you brought the cluster back up the Namenode would be expecting x blocks of data to be on the node it is shutdown. Regardless of the ambari-agent being running or not, when you started it up it sent a block report to the namenode. If the Namenode "sees" in that block report that original blocks of data(before you shutdown the cluster) are missing, it will simply replicate these block from a healthy data node to other nodes. So in this example a block needs to have 3 replicas across the cluster. If When receiving all block reports from all data node, the Namenode sees that certain blocks are not compliant with that rule, then it will replicate those blocks to other healthy nodes automatically

About starting back up the nodes you are safe to do it as HDFS is prepared to "deal" with this kind of situation.

One thing to look at is if you setup local directories on the failed mountpoint for the services that were running on the node.Make sure that is not the case, and if yes you are ok to startup again the services and ambari-agent.