As a Hadoop Admin it’s our responsibility to perform Hadoop Cluster Maintenance frequently.
Let’s see what we can do to keep our big elephant happy! ;)
1. FileSystem Checks
We should check health of HDFS periodically by running fsck command
sudo -u hdfs hadoop fsck /
This command contacts the Namenode and checks each file recursively which comes under the provided path.
Below is the sample output of fsck command
sudo -u hdfs hadoop fsck /
FSCK started by hdfs (auth:SIMPLE) from /10.0.2.15 for path / at Wed Apr 06 18:47:37 UTC 2016
Total size: 1842803118 B
Total dirs: 4612
Total files: 11123
Total symlinks: 0 (Files currently being written: 4)
Total blocks (validated): 11109 (avg. block size 165883 B) (Total open file blocks (not validated): 1)
Minimally replicated blocks: 11109 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 11109 (100.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 1.0
Corrupt blocks: 0
Missing replicas: 22232 (66.680664 %)
Number of data-nodes: 1
Number of racks: 1
FSCK ended at Wed Apr 06 18:46:54 UTC 2016 in 1126 milliseconds
The filesystem under path '/' is HEALTHY
We can schedule a weekly cron job on edge node which will run fsck and send the output via email to Hadoop Admin.
2. HDFS Balancer utility
Over the period of time data becomes un-balanced across all the Datanodes in the cluster, this could be because of maintenance activity on specific Datanode, power failure, hardware failures, kernel panic, unexpected reboots etc. In this case because of data locality, Datanodes which are having more data will get churned and ultimately un-balanced cluster can directly affect your MapReduce job performance.
It’s a bad idea to stop single or multiple Datanode daemons or shutdown them gracefully though HDFS is fault tolerant. Better solution is to add ip address of Datanode machine that we need to remove from cluster to exclude file which is maintained by dfs.hosts.exclude property and run below command
sudo -u hdfs hdfs dfsadmin -refreshNodes
After this, Namenode will start replicating all the blocks to other existing Datanodes in the cluster, once decommission process is complete then it’s safe to shutdown Datanode daemon. You can track progress of decommission process on NN Web UI.
4.1 For YARN:
Add ip address of node manager machine to the file maintained by yarn.resourcemanager.nodes.exclude-path property and run below command.
sudo -u yarn yarn rmadmin -refreshNodes
5. Datanode Volume Failures
Namenode WebUI shows information about Datanode volume failures, we should check this information periodically or set some kind of automated monitoring system using Nagios or Ambari Metrics if you are using Hortonworks Hadoop Distribution or JMX monitoring (http://<namenode-host>:50070/jmx) etc. Multiple disk failures on single Datanode could cause shutdown of Datanode daemon. ( Please check dfs.datanode.failed.volumes.tolerated property and set it accordingly in hdfs-site.xml )
6. Database Backups
If we you have multiple Hadoop ecosystem components installed then you should schedule a backup script to take database dumps.
1. Hive metastore database
3. Ambari DB
4. Ranger DB
Create a simple shell script to have backup commands and schedule it on a weekend, add a logic to send an email once backups are done.
7. HDFS Metadata backup
fsimage has metadata about your Hadoop file system and if for some reason it gets corrupted then your cluster is un-usable, it’s very important to keep periodic backups of filesystem fsimage.
You can schedule a shell script which will have below command to take backup of fsimage
hdfs dfsadmin -fetchImage fsimage.backup.ddmmyyyy
8. Purging older log files
In production clusters, if we don’t clean older Hadoop log files then it can eat your entire disk and daemons could crash because of “no space left on device” error. Always get older log files cleaned via cleanup script!
Please comment if you have any feedback/questions/suggestions. Happy Hadooping!! :-)