Support Questions
Find answers, ask questions, and share your expertise

what is the best backup and recovery solution for a full hortonworks production hadoop deployment. ( not using third party backup software)



Hadoop deployment has the following services
HDFS sit on an isilon. ( nas storage)
Ambari Metrics

Rising Star


The article is very general and doe not provide concrete and specific backup & recovery tools.


I did see this article.. But there is a disclaimer for production use

Disclaimer: 1. This article is solely my personal take on disaster recovery in a Hadoop cluster

2. Disaster Recovery is specialized subject in itself. Do not Implement something based on this article in production until you have a good understanding on what you are implementing

Has Hortonworks a formal document on backup and recovery of their Hadoop environments??

If so where can I find it.

Many thanks

Expert Contributor

This info might be more helpful to guide you down the road of DR. With HDP in production, you must combine different technologies being offered and tailor these together as your own solution. I've read through many solutions, and the info below is the most critical in my opinion. Remember, preventing data loss is better than recovering from it!

Read these slides first:

1. VM Snapshots

  • If your not using VM's, then switch over
  • Ambari nightly VM snapshots
  • Namenode VM snapshots

2. Lockdown critical directories:

fs.protected.directories - Under HDFS config in ambari

Protect critical directories from deletion. There could be accidental deletes of the critical data-sets. These catastrophic errors should be avoided by adding appropriate protections. For example the /user directory is the parent of all user-specific sub-directories. Attempting to delete the entire /user directory is very likely to be unintentional. To protect against accidental data loss, mark the /user directory as protected. This prevents attempts to delete it unless the directory is already empty


3. Backups

Backups can be automated using tools like Apache Falcon (being deprecated in HDP 3.0, switch to workflow editor + DistCp) and Apache Oozie



Using Snapshots

HDFS snapshots can be combined with DistCp to create the basis for an online backup solution. Because a snapshot is a read-only, point-in-time copy of the data, it can be used to back up files while HDFS is still actively serving application clients. Backups can even be automated using tools like Apache Falcon and Apache Oozie.



“Accidentally” remove the important file

sudo -u hdfs hdfs dfs -rm -r -skipTrash /tmp/important-dir/important-file.txt

Recover the file from the snapshot:

hdfs dfs -cp /tmp/important-dir/.snapshot/first-snapshot/important-file.txt /tmp/important-dir

hdfs dfs -cat /tmp/important-dir/important-file.txt

HDFS Snapshots Overview

A snapshot is a point-in-time, read-only image of the entire file system or a sub tree of the file system.

HDFS snapshots are useful for:

  • Protection against user error: With snapshots, if a user accidentally deletes a file, the file can be restored from the latest snapshot that contains the file.
  • Backup: Files can be backed up using the snapshot image while the file system continues to serve HDFS clients.
  • Test and development: Files in an HDFS snapshot can be used to test new programs without affecting the HDFS file system that is concurrently supporting HDFS clients.
  • Disaster recovery: Snapshots can be replicated to a remote recovery site for disaster recovery.

DistCp Overview

Hadoop DistCp (distributed copy) can be used to copy data between Hadoop clusters or within a Hadoop cluster. DistCp can copy just files from a directory or it can copy an entire directory hierarchy. It can also copy multiple source directories to a single target directory.


  • Uses MapReduce to implement its I/O load distribution, error handling, and reporting.
  • Has built-in support for multiple file system types. It can work with HDFS, Amazon S3, Cassandra, and others. DistCp also supports copying between different HDFS versions.
  • Can generate a significant workload on the cluster if a large volume of data is being transferred.
  • Has many command options. Use hadoop distcp –help to get online command help information.



Interesting info. However, the second slide your pointed out, Hortonworks Operational Best Practice, is not directly related to the topic.

; ;