Support Questions
Find answers, ask questions, and share your expertise

How can we use of ebs snapshots to build hdfs after recovery? Is there any preferrable documentation?

Expert Contributor


This question has been forwarded to the HDFS team. I don't think we have anything in the HDP docs on EBS snapshots, but this Amazon forum post seems to be related:

Expert Contributor

Thank you @dhoyle

The inquiry is a little short of detail, but I infer you are proposing a Hadoop cluster running in EC2, with the datanodes and namenode using EBS volumes on each virtual server for storage, and taking EBS snapshots of those volumes periodically for backup/disaster recovery purposes. That is the scenario I will address.

First, you could just make the EBS volumes persistent. That way, they should survive most server crashes. If you take the effort to write scripts that associate the correct EBS persistent volumes with the correct VM instances, you'll have a system as recoverable (in a self-contained way) as a physical cluster. If you then permit it the usual 3x replication, and deploy Namenode HA, you'll have an extremely robust data store that can self-recover from pretty much anything except Amazon suffering a data center scale disaster. This is without relying on EBS Snapshots at all, and typically with much less downtime than a traditional backup-based recovery model.

Perhaps you are trying to proof against data center scale disasters, or you wish to avoid the cost of 3x data duplication within EBS. It is true that EBS Snapshots do go to lower-cost S3 storage, but they will typically stay within the same availability zone, and therefore will not assist with the data center disaster scenario. The problem is that there's no expectation that the data in the Snapshots will be a consistent image of HDFS across all the many EBS volumes. There's no synchronization mechanism, thus the many snapshots of many volumes taken over a many-hour period of time, will not produce a true picture of HDFS.

To make EBS Snapshots a useful backup of HDFS, what you would have to do is place HDFS into maintenance mode, where reads are allowed, but writes are not. Then force EBS Snapshots of all the datanode and namenode storage volumes. This is actually feasible, since Amazon cleverly implemented two efficiencies in EBS Snapshots:

  • snapshots are incremental, so they only take as long as needed to copy the changes since last snapshot;
  • snapshots' "point in time" is established very quickly, then the actual data copying happens asynchronously over however many hours are needed, while the changed blocks are protected by "copy on write" logic. It's just important to leave enough time for the snapshot to complete before invoking the next one.

This means you won't have to keep HDFS in maintenance mode for an unreasonably long time in order to make a snapshot. But you will have to do daily (or so) "pauses" in the data intake while you make the snapshots. As noted, data analysis (read-only activities) can continue during this time.

However, you'll be on your own for disaster recovery from this model of backup, and you'll only be able to restore to the last "point in time", and it will require full shut-down of the cluster. If you instead use HDFS's built-in robustness with Namenode HA, 3x replication, and persistent EBS volumes, then it will auto-recover most problem scenarios to recover everything up to the time of difficulty (except perhaps data blocks being actually written at the moment of a crash), and with little or no downtime.