Support Questions

Find answers, ask questions, and share your expertise

replace faulty disk on the worker machine

avatar

we have worker machine in the ambari cluster that have faulty disk

disk device name is /dev/sdb

so we need to replace the faulty disk with a new disk

what are the steps that need to do on the worker machine before replacing the disk

and after replacing the disk ?

Michael-Bronson
1 ACCEPTED SOLUTION

avatar
Expert Contributor

@Michael Bronson Assuming you are talking about a data node. If you are able to replace this disk trivially, that is the operation is simply to pull the disk out of a JBOD, then you can shut down this datanode, replace the disk, format and mount it back. HDFS will detect that it has lost a set of blocks (probably it has already done it since the disk is faulty and no i/o is happening to that disk), and replicate them correctly. You can check if you have any under-replicated blocks in your cluster. You can replace the disk and things will return to normal. There is, however, a small hitch, this new disk will not have the same amount of data as the other disks. If you are running Hadoop 3.0-- it is still in beta and not production ready -- you can run the diskBalancer tool, which will move data from other disks to this new disk. Generally, this will not be an issue.

If the disk replacement in your machines is not as straightforward as I described, you can ask Ambari to put this machine into a maintenance state. That will tell HDFS not to replicate all the blocks after the 10 mins window ( by default) when a machine is declared dead. You can do that and then do the operation.

Just so that you are aware, HDFS supports a notion of failed volumes. So if you have a data node with a large number of disks, say 8 disks, then you can set the failed volume tolerance, say to something like 2. This will make sure that node works well even in the face of two disks failing. if you do that, you can replace disks when you have a scheduled maintenance window with downtime.

Please let me know if you have any more questions or need more help on this.

View solution in original post

3 REPLIES 3

avatar
Expert Contributor

@Michael Bronson Assuming you are talking about a data node. If you are able to replace this disk trivially, that is the operation is simply to pull the disk out of a JBOD, then you can shut down this datanode, replace the disk, format and mount it back. HDFS will detect that it has lost a set of blocks (probably it has already done it since the disk is faulty and no i/o is happening to that disk), and replicate them correctly. You can check if you have any under-replicated blocks in your cluster. You can replace the disk and things will return to normal. There is, however, a small hitch, this new disk will not have the same amount of data as the other disks. If you are running Hadoop 3.0-- it is still in beta and not production ready -- you can run the diskBalancer tool, which will move data from other disks to this new disk. Generally, this will not be an issue.

If the disk replacement in your machines is not as straightforward as I described, you can ask Ambari to put this machine into a maintenance state. That will tell HDFS not to replicate all the blocks after the 10 mins window ( by default) when a machine is declared dead. You can do that and then do the operation.

Just so that you are aware, HDFS supports a notion of failed volumes. So if you have a data node with a large number of disks, say 8 disks, then you can set the failed volume tolerance, say to something like 2. This will make sure that node works well even in the face of two disks failing. if you do that, you can replace disks when you have a scheduled maintenance window with downtime.

Please let me know if you have any more questions or need more help on this.

avatar

thank you for the answer , so can we summarize the steps

1. shutdown the worker machine ( stop all components )

2. replace the faulty disk with new disk ( same size )

3. start up the worker machine

4. create ext4 file system on the new disk - sdb (by mkfs)

5. start all worker component ( it will create the relevant folders under the sdb disk

please let me know if my steps are true

second - about "Ambari to put this machine into a maintenance state" - how to set it on the ambari ? and on which of my steps need to defined this ?

Michael-Bronson

avatar
Expert Contributor
@Michael Bronson

The steps described by you looks good. If you have Ambari running against this cluster, you should be able to find an option called "Maintenance mode" in the menus.

Here is some documentation about that,

https://docs.hortonworks.com/HDPDocuments/Ambari-2.6.0.0/bk_ambari-operations/content/setting_mainte...

It is not needed for your replace your disks, but it will avoid spurious alerts in your system.