- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
replace faulty disk on the worker machine
- Labels:
-
Apache Ambari
-
Apache Hadoop
Created ‎11-30-2017 06:03 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
we have worker machine in the ambari cluster that have faulty disk
disk device name is /dev/sdb
so we need to replace the faulty disk with a new disk
what are the steps that need to do on the worker machine before replacing the disk
and after replacing the disk ?
Created ‎11-30-2017 06:46 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Michael Bronson Assuming you are talking about a data node. If you are able to replace this disk trivially, that is the operation is simply to pull the disk out of a JBOD, then you can shut down this datanode, replace the disk, format and mount it back. HDFS will detect that it has lost a set of blocks (probably it has already done it since the disk is faulty and no i/o is happening to that disk), and replicate them correctly. You can check if you have any under-replicated blocks in your cluster. You can replace the disk and things will return to normal. There is, however, a small hitch, this new disk will not have the same amount of data as the other disks. If you are running Hadoop 3.0-- it is still in beta and not production ready -- you can run the diskBalancer tool, which will move data from other disks to this new disk. Generally, this will not be an issue.
If the disk replacement in your machines is not as straightforward as I described, you can ask Ambari to put this machine into a maintenance state. That will tell HDFS not to replicate all the blocks after the 10 mins window ( by default) when a machine is declared dead. You can do that and then do the operation.
Just so that you are aware, HDFS supports a notion of failed volumes. So if you have a data node with a large number of disks, say 8 disks, then you can set the failed volume tolerance, say to something like 2. This will make sure that node works well even in the face of two disks failing. if you do that, you can replace disks when you have a scheduled maintenance window with downtime.
Please let me know if you have any more questions or need more help on this.
Created ‎11-30-2017 06:46 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Michael Bronson Assuming you are talking about a data node. If you are able to replace this disk trivially, that is the operation is simply to pull the disk out of a JBOD, then you can shut down this datanode, replace the disk, format and mount it back. HDFS will detect that it has lost a set of blocks (probably it has already done it since the disk is faulty and no i/o is happening to that disk), and replicate them correctly. You can check if you have any under-replicated blocks in your cluster. You can replace the disk and things will return to normal. There is, however, a small hitch, this new disk will not have the same amount of data as the other disks. If you are running Hadoop 3.0-- it is still in beta and not production ready -- you can run the diskBalancer tool, which will move data from other disks to this new disk. Generally, this will not be an issue.
If the disk replacement in your machines is not as straightforward as I described, you can ask Ambari to put this machine into a maintenance state. That will tell HDFS not to replicate all the blocks after the 10 mins window ( by default) when a machine is declared dead. You can do that and then do the operation.
Just so that you are aware, HDFS supports a notion of failed volumes. So if you have a data node with a large number of disks, say 8 disks, then you can set the failed volume tolerance, say to something like 2. This will make sure that node works well even in the face of two disks failing. if you do that, you can replace disks when you have a scheduled maintenance window with downtime.
Please let me know if you have any more questions or need more help on this.
Created ‎11-30-2017 07:05 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thank you for the answer , so can we summarize the steps
1. shutdown the worker machine ( stop all components )
2. replace the faulty disk with new disk ( same size )
3. start up the worker machine
4. create ext4 file system on the new disk - sdb (by mkfs)
5. start all worker component ( it will create the relevant folders under the sdb disk
please let me know if my steps are true
second - about "Ambari to put this machine into a maintenance state" - how to set it on the ambari ? and on which of my steps need to defined this ?
Created ‎12-01-2017 06:19 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The steps described by you looks good. If you have Ambari running against this cluster, you should be able to find an option called "Maintenance mode" in the menus.
Here is some documentation about that,
It is not needed for your replace your disks, but it will avoid spurious alerts in your system.
