Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Best way to reboot HDFS/Yarn node - decommission first?

Best way to reboot HDFS/Yarn node - decommission first?

Explorer

What's the best way to reboot a CDH 5.3.1 node? Will decommissioning and recommissioning the node help? Or is it sufficient to just stop all services on the node, reboot it, and restart?

 

The nodes I need to reboot are HDFS and Yarn node, and some are HDFS namenodes - we are running in HA mode so is it safe to assume the order in which the active/standby namenodes are restarted doesn't matter?

 

FWIW we are doing this because there is a kernel bug in Ubuntu 12 that causes HDFS errors on our Yarn nodes:

 

Feb 11 13:17:11 abacus105 kernel: [8638490.380039] EXT4-fs warning (device sdd5): ext4_da_update_reserve_space:362: ino 6553807, allocated 1 with only 0 reserved metadata blocks (releasing 4 blocks with reserved 38 data blocks)

 

These errors occur across all yarn cache filesystems on affected hosts but not on HDFS data filesystems. The errors cause a kernel thread to dump but don't seem to be causing problems with data integrity. Kernels at version 3.13 do not seem to be affected.

1 REPLY 1

Re: Best way to reboot HDFS/Yarn node - decommission first?

For the Datanodes, if the reboot means the node will only be out of
action for under 10m30s, then you are fine to just stop HDFS and YARN
services and reboot the node. The 10m30s timeout is the default for a
namenode to consider a datanode as dead before it begins to schedule
block copying across the cluster (to maintain a valid replication
factor for each file).

If the maintenance window will be longer than 10m, then we can tune
the timeout temporarily to a larger value and undo the change after
maintenance work is done.

For the Namenode roles, you should best reboot the standby first. Then
after it is back up healthy, perform a failover then reboot the other
node. All this is asssuming you want to keep the cluster online all
the way.



Regards,
Gautam Gopalakrishnan
Don't have an account?
Coming from Hortonworks? Activate your account here