|A Disaster Recovery strategy for Hadoop solution would be to set up another cluster that serves as the backup. With two clusters, there are two approaches to have synched data in both clusters: |
This ensures that the copying of data is done in an efficient parallelized manner.
If the two clusters are setup to run the same version of Hadoop, then distcp can use the hdfs protocol to transfer data directly from datanodes of one cluster to another cluster. The command to do this would look like this:
$ hadoop distcp hdfs://namenode1:8020/source/folder / hdfs://namenode2:8020/dest/folder
Distcp can be used to transfer data between clusters that run different versions. In that case, you would use the HTTP protocol (WebHDFS) to communicate to one of the clusters. The command would look like:
$ hadoop distcp hdfs://namenode1:8020/source/folder / webhdfs://namenode2:50070/dest/folder
Disaster Recovery in Hadoop cluster refers to the event of recovering all or most of your important data stored on a Hadoop Cluster in case of disasters like hardware failures,data loss ,applications error. There should be minimal or no downtime in cluster.
Disaster can be handled through various techniques :
1) Data loss must be preveneted by writing metadata stored on namenode to a different NFS mount. However High Availability introduced in the latest version of Hadoop is a disaster management technique.
2) HDFS snapshots can also be used in case of recovery.
3) You can enable Trash feature in case of accidental deletion because file deleted first goes to trash folder in HDFS.
4) Hadoop distcp tool can also be used for cluster data copying building a mirror cluster in case of any hardware failure