Support Questions

Find answers, ask questions, and share your expertise

Snapshots, Backup and DR

avatar
Rising Star

I have some questions around HDFS snapshots which can be used for backup and DR purposes.

  • How does snapshots help for Disaster Recovery? What are the best practices around using snapshots for DR purposes? Especially trying to understand when data is directly stored on HDFS, Hive data and HBase data
  • Can a directory be deleted using hdfs dfs -rmr -skipTrash /data/snapshot-dir? Or is it that all the snapshots have to be deleted first and then the snapshotting be disabled before allowing the directory be deleted?
  • As I understand, no data is copied for snapshots, but only metadata is maintained for the blocks added/ modified / deleted. If that’s the case, just wondering what happens when the comamnd hdfs dfs -rm /data/snapshot-dir/file1 is run. Will the file be moved to the trash? If so, will the snapshot maintain the reference to the entry in trash? Will trach eviction has any impact in this case?
  • What happens when one of the sub-directory under the snapshot directory is deleted? For example, if the command hdfs dfs -rmr -skipTrash /data/sub-dir is run? Can the data be recovered from snapshots?
  • Can snapshots be deleted / archived automatically based on policies, for example time-based? In the above example, how long will the sub-dir data be maintained in the snapshot?
  • How does snapshots work along with HDFS quotas. For example, assume a directory with a quota of 1GB with snapshotting enabled. Assume the directory is closer to its full quota and a user deleted a large file to store some other dataset. Will the new data be allowed to be saved to the directory or will the operation be stopped because the quota limits have been exceeded?

Apologies if some of the questions doesn’t make sense. I am still trying to understand these concepts at a ground level.

1 ACCEPTED SOLUTION

avatar
Super Guru
@bigdata.neophyte

Answers inline.

  • How does snapshots help for Disaster Recovery? What are the best practices around using snapshots for DR purposes? Especially trying to understand when data is directly stored on HDFS, Hive data and HBase data
    • Snapshot will not be best option for DR.
  • Can a directory be deleted using hdfs dfs -rmr -skipTrash /data/snapshot-dir? Or is it that all the snapshots have to be deleted first and then the snapshotting be disabled before allowing the directory be deleted?
    • No. If the directory is snapshottable it cannot be deleted.
  • As I understand, no data is copied for snapshots, but only metadata is maintained for the blocks added/ modified / deleted. If that’s the case, just wondering what happens when the comamnd hdfs dfs -rm /data/snapshot-dir/file1 is run. Will the file be moved to the trash? If so, will the snapshot maintain the reference to the entry in trash? Will trach eviction has any impact in this case?
    • If a file is deleted from snapshottable directory it will be stored in .snapshot folder and you can copy data back. For eg.
    • [hdfs@node1 ~]$ hdfs dfs -rm -r -skipTrash /tmp/test/anaconda-ks.cfg Deleted /tmp/test/anaconda-ks.cfg [hdfs@node1 ~]$ hadoop fs -ls /tmp/test/.snapshot/s20160526-022510.203 -rw-r--r-- 3 hdfs hdfs 1155 2016-05-26 02:23 /tmp/test/.snapshot/s20160526-022510.203/anaconda-ks.cfg [hdfs@node1 ~]$ hadoop fs -cp /tmp/test/.snapshot/s20160526-022510.203/anaconda-ks.cfg /tmp/test
    • You will be able to see the data back in the file.
  • What happens when one of the sub-directory under the snapshot directory is deleted? For example, if the command hdfs dfs -rmr -skipTrash /data/sub-dir is run? Can the data be recovered from snapshots?
    • If the subdirectory already exist when " hdfs dfsadmin -allowSnapshot" was run then if we delete directory it will be there in ".snapshot" folder. Eg " /user/test1/.snapshot/s20160526-025323.341/subdir/ambari.properties.2"
    • If the subdirectory is created later then it will not be backed up.
  • Can snapshots be deleted / archived automatically based on policies, for example time-based? In the above example, how long will the sub-dir data be maintained in the snapshot?
    • You can create custom script to delete/archive snap based on policies. The snapshot will be maintained till we delete it.
  • How does snapshots work along with HDFS quotas. For example, assume a directory with a quota of 1GB with snapshotting enabled. Assume the directory is closer to its full quota and a user deleted a large file to store some other dataset. Will the new data be allowed to be saved to the directory or will the operation be stopped because the quota limits have been exceeded?
    • It will will allow you to exceed the quota mentioned. It will give warning.

Apologies if some of the questions doesn’t make sense. I am still trying to understand these concepts at a ground level.

View solution in original post

3 REPLIES 3

avatar

Wow, a TON of questions around Snapshots; I'll try to hit on most of them. Sounds like you might have already found these older posts on this topic, http://hortonworks.com/blog/snapshots-for-hdfs/ & http://hortonworks.com/blog/protecting-your-enterprise-data-with-hdfs-snapshots/.

For DR (data onto another cluster) you'll need to export these snapshots with a tool like distcp. As you go up into the Hive and HBase stacks, you have some other tools and options in addition to this. My recommendation is to open a dedicated HCC question for each after you do a little research and we can all jump in to help anything you don't understand.

As with all things, the best way to find out is to give it a try. As the next bit shows, you cannot delete a snapshot like "normal"; you have to use the special delete snapshot command.

[root@sandbox ~]# hdfs dfs -mkdir testsnaps
[root@sandbox ~]# hdfs dfs -put /etc/group testsnaps/
[root@sandbox ~]# hdfs dfs -ls testsnaps
Found 1 items
-rw-r--r--   3 root hdfs       1196 2016-05-25 14:18 testsnaps/group
[root@sandbox ~]# su - hdfs
[hdfs@sandbox ~]$ hdfs dfsadmin -allowSnapshot /user/root/test
snapsAllowing snaphot on /user/root/testsnaps succeeded
[hdfs@sandbox ~]$ exit
logout
[root@sandbox ~]# hdfs dfs -createSnapshot /user/root/testsnaps snap1
Created snapshot /user/root/testsnaps/.snapshot/snap1
[root@sandbox ~]# hdfs dfs -ls testsnaps/.snapshot/snap1
Found 1 items
-rw-r--r--   3 root hdfs       1196 2016-05-25 14:18 testsnaps/.snapshot/snap1/group
[root@sandbox ~]# hdfs dfs -rmr -skipTrash /user/root/testsnaps/.snapshot/snap1
rmr: DEPRECATED: Please use 'rm -r' instead.
rmr: Modification on a read-only snapshot is disallowed
[root@sandbox ~]# hdfs dfs -deleteSnapshot /user/root/testsnaps snap1
[root@sandbox ~]# hdfs dfs -ls testsnaps/.snapshot
[root@sandbox ~]# 

There is no auto-delete of snapshots. The rule of thumb is that if you create them (likely with an automated process) then you need to have a complimentary process to delete them as you can clog up HDFS space if the data directory you are snapshotting actually does change.

Snapshots should not adversely affect your quotas, with the exception I just called out about them hanging onto HDFS space for items you have deleted from the actual directory that you do have 1+ snapshot pointing to.

Have fun playing around with snapshots & good luck!

avatar
Super Guru
@bigdata.neophyte

Answers inline.

  • How does snapshots help for Disaster Recovery? What are the best practices around using snapshots for DR purposes? Especially trying to understand when data is directly stored on HDFS, Hive data and HBase data
    • Snapshot will not be best option for DR.
  • Can a directory be deleted using hdfs dfs -rmr -skipTrash /data/snapshot-dir? Or is it that all the snapshots have to be deleted first and then the snapshotting be disabled before allowing the directory be deleted?
    • No. If the directory is snapshottable it cannot be deleted.
  • As I understand, no data is copied for snapshots, but only metadata is maintained for the blocks added/ modified / deleted. If that’s the case, just wondering what happens when the comamnd hdfs dfs -rm /data/snapshot-dir/file1 is run. Will the file be moved to the trash? If so, will the snapshot maintain the reference to the entry in trash? Will trach eviction has any impact in this case?
    • If a file is deleted from snapshottable directory it will be stored in .snapshot folder and you can copy data back. For eg.
    • [hdfs@node1 ~]$ hdfs dfs -rm -r -skipTrash /tmp/test/anaconda-ks.cfg Deleted /tmp/test/anaconda-ks.cfg [hdfs@node1 ~]$ hadoop fs -ls /tmp/test/.snapshot/s20160526-022510.203 -rw-r--r-- 3 hdfs hdfs 1155 2016-05-26 02:23 /tmp/test/.snapshot/s20160526-022510.203/anaconda-ks.cfg [hdfs@node1 ~]$ hadoop fs -cp /tmp/test/.snapshot/s20160526-022510.203/anaconda-ks.cfg /tmp/test
    • You will be able to see the data back in the file.
  • What happens when one of the sub-directory under the snapshot directory is deleted? For example, if the command hdfs dfs -rmr -skipTrash /data/sub-dir is run? Can the data be recovered from snapshots?
    • If the subdirectory already exist when " hdfs dfsadmin -allowSnapshot" was run then if we delete directory it will be there in ".snapshot" folder. Eg " /user/test1/.snapshot/s20160526-025323.341/subdir/ambari.properties.2"
    • If the subdirectory is created later then it will not be backed up.
  • Can snapshots be deleted / archived automatically based on policies, for example time-based? In the above example, how long will the sub-dir data be maintained in the snapshot?
    • You can create custom script to delete/archive snap based on policies. The snapshot will be maintained till we delete it.
  • How does snapshots work along with HDFS quotas. For example, assume a directory with a quota of 1GB with snapshotting enabled. Assume the directory is closer to its full quota and a user deleted a large file to store some other dataset. Will the new data be allowed to be saved to the directory or will the operation be stopped because the quota limits have been exceeded?
    • It will will allow you to exceed the quota mentioned. It will give warning.

Apologies if some of the questions doesn’t make sense. I am still trying to understand these concepts at a ground level.

avatar
Contributor
Answers by @Sagar Shimpi and @Lester Martin look pretty good to me. Some further explanations:
  • How does snapshots help for Disaster Recovery? What are the best practices around using snapshots for DR purposes? Especially trying to understand when data is directly stored on HDFS, Hive data and HBase data

If you're using the current distcp for DR (i.e., using distcp copying data from one cluster to your backup cluster), you have an option to utilize snapshot to do incremental backup so as to improve the distcp performance/efficiency. More specifically, you can choose to take snapshots in both the source and the backup cluster and use -diff option for the distcp command. Then instead of blindly copying all the data, the distcp will first compute the difference between the given snapshots, and only copy the difference to the backup cluster.

  • As I understand, no data is copied for snapshots, but only metadata is maintained for the blocks added/ modified / deleted. If that’s the case, just wondering what happens when the comamnd hdfs dfs -rm /data/snapshot-dir/file1 is run. Will the file be moved to the trash? If so, will the snapshot maintain the reference to the entry in trash? Will trach eviction has any impact in this case?

Yes, if you have not skipped the trash, the file will be moved to the trash, and in the meanwhile, you can still access the file using the corresponding snapshot path.

  • How does snapshots work along with HDFS quotas. For example, assume a directory with a quota of 1GB with snapshotting enabled. Assume the directory is closer to its full quota and a user deleted a large file to store some other dataset. Will the new data be allowed to be saved to the directory or will the operation be stopped because the quota limits have been exceeded?

No, if the file belongs to the snapshot (i.e., the file was created before a snapshot was taken), you will not release quota by deleting it. You may have to delete some old snapshots or increase your quota limit. Also in some old hadoop versions you may find the snapshots also affect the namespace quota usage in a strange way, i.e., sometimes deleting a file can increase the quota usage. This has been fixed by the latest version of HDP.