Support Questions
Find answers, ask questions, and share your expertise

How to plan disaster recovery in Hadoop cluster


Super Collaborator
A Disaster Recovery strategy for Hadoop solution would be to set up another cluster that serves as the backup. With two clusters, there are two approaches to have synched data in both clusters:
  • Fork the ETL process to write to both clusters at ingest
  • Have one active cluster from which data is copied over to a backup cluster periodically
To move data from one cluster to another in a reliable manner, you can use the Hadoop distcp tool. Distcp runs a map reduce job to transfer data from one cluster to another. The job spawns multiple mappers only that do the copying - no reducers. Each file is copied over by a single mapper. The Distcp utility splits up the number of files that need to be copied, and equally distributes the responsibilty to copy files across the mappers spawned.

This ensures that the copying of data is done in an efficient parallelized manner.

If the two clusters are setup to run the same version of Hadoop, then distcp can use the hdfs protocol to transfer data directly from datanodes of one cluster to another cluster. The command to do this would look like this:
$ hadoop distcp hdfs://namenode1:8020/source/folder / hdfs://namenode2:8020/dest/folder

Distcp can be used to transfer data between clusters that run different versions. In that case, you would use the HTTP protocol (WebHDFS) to communicate to one of the clusters. The command would look like:
$ hadoop distcp hdfs://namenode1:8020/source/folder / webhdfs://namenode2:50070/dest/folder

Disaster Recovery in Hadoop cluster refers to the event of recovering all or most of your important data stored on a Hadoop Cluster in case of disasters like hardware failures,data loss ,applications error. There should be minimal or no downtime in cluster.

Disaster can be handled through various techniques :

1) Data loss must be preveneted by writing metadata stored on namenode to a different NFS mount. However High Availability introduced in the latest version of Hadoop is a disaster management technique.

2) HDFS snapshots can also be used in case of recovery.

3) You can enable Trash feature in case of accidental deletion because file deleted first goes to trash folder in HDFS.

4) Hadoop distcp tool can also be used for cluster data copying building a mirror cluster in case of any hardware failure

; ;