Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

What are the steps an operator should take to replace disk in data node? Correction - NameNode

avatar
Contributor

Partner I am working with is looking for instructions to change a disk in datanote host

They could find the instructions for replacing disk on datanode here- http://www.cloudera.com/content/www/en-us/documentation/manager/5-0-x/Cloudera-Manager-Managing-Clus...

But could not find anything on the steps that an operator should take to replace a disk in namenode.

Looking for some steps or pointer to a doc that might have these steps?

Thanks.

1 ACCEPTED SOLUTION

avatar
Master Mentor

@vsomani@hortonworks.com

NameNode disk failure. There are couple of if's

1 - HA + RAID 10

If HA is in place then failover to Passive (Assuming that active NN disk failed) + if RAID 10 is configured for NN then you are safe and have enough time to replace failed disk.

"When a single disk in a RAID 10 disk array fails, the disk array status changes to Degraded. The disk array remains functional because the data on the Failed disk is also stored on the other member of its mirrored pair.When ever a disk fails, replace it as soon as possible. If a hot spare disk is available, the controller can rebuild the data on the disk automatically. If a hot spare disk is not available, you will need to replace the failed disk and then initiate a rebuild. "

2 scenario - No HA, No RAID but NN backup in place + "dfs.namenode.name.dir" is writing to multiple disks.

You are safe as NN metadata writing to multiple disks so you can remove the disk location from Ambari and let operator recover the disk failure.

3 scenario - Bad design : No HA, No RAID, dfs.namenode.name.dir writing to single disk

Cluster is down. Backup everything that you can from NN. Let operator replace the disk. Restore the backup and then starts the troubleshooting process.

Good disucssion here 1

View solution in original post

5 REPLIES 5

avatar

@vsomani@hortonworks.com

Steps to replace disk in slavenodes or to perform maintenance of slavenode servers remains the same irrespective of Hadoop distribution. We don't have dedicated steps in our doc AFAIK. But below should be the steps.

1. Decommission the Datanode and all services running on it i.e. NodeManager, HBase RegionServer, Datanode etc. Below is reference for the same.

http://docs.hortonworks.com/HDPDocuments/Ambari-2.1.2.0/bk_Ambari_Users_Guide/content/_decommissioni...

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_Sys_Admin_Guides/content/ch_slave_nodes.h...

2. Replace the disks or perform any other tasks for server maintenace.

3. Recommission the node.

4. Start all services components on the node.

5. Run Fsck for HDFS to ensure that HDFS is in healthy state. FSCK report might show a few over replicated blocks which would automatically be fixed.

avatar

@Neeraj Should We keep this answer or remove it. Looks like @vsomani@hortonworks.com changed the question. I have created an article out of it. http://community.hortonworks.com/articles/3131/replacing-disk-on-datanode-hosts.html

avatar
Master Mentor

@vsomani@hortonworks.com

NameNode disk failure. There are couple of if's

1 - HA + RAID 10

If HA is in place then failover to Passive (Assuming that active NN disk failed) + if RAID 10 is configured for NN then you are safe and have enough time to replace failed disk.

"When a single disk in a RAID 10 disk array fails, the disk array status changes to Degraded. The disk array remains functional because the data on the Failed disk is also stored on the other member of its mirrored pair.When ever a disk fails, replace it as soon as possible. If a hot spare disk is available, the controller can rebuild the data on the disk automatically. If a hot spare disk is not available, you will need to replace the failed disk and then initiate a rebuild. "

2 scenario - No HA, No RAID but NN backup in place + "dfs.namenode.name.dir" is writing to multiple disks.

You are safe as NN metadata writing to multiple disks so you can remove the disk location from Ambari and let operator recover the disk failure.

3 scenario - Bad design : No HA, No RAID, dfs.namenode.name.dir writing to single disk

Cluster is down. Backup everything that you can from NN. Let operator replace the disk. Restore the backup and then starts the troubleshooting process.

Good disucssion here 1

avatar
Contributor

Thanks Neeraj.

In this case, the partner has HA, but no RAID. So they'll just need to failover to to the Passive NN

avatar
Master Mentor