Created on 08-19-201901:04 PM - edited 09-16-202201:45 AM
HDFS Snapshots are a great way to backup important data on HDFS. It's extremely easy to implement and it helps safeguard your data from instances where a user or admin accidentally deletes data. In the article below, we'll walkthrough some simple examples of using snapshots and some of the gotchas to look out for when implementing them.
Part 1: Understanding Snapshots
First lets create some files and directories for testing:
echo "Hello World" > file1.txt
echo "How are you" > file2.txt
echo "hdfs snapshots are great" > file3.txt
hdfs dfs -mkdir /tmp/snapshot_dir
hdfs dfs -mkdir /tmp/snapshot_dir/dir1
Next lets put file1.txt in the directory:
hdfs dfs -put file1.txt /tmp/snapshot_dir/dir1
Creating a snapshot is a really simple process. You will need a user with superuser permissions to create a snapshottable directory. With this process, we enable the directory to have snapshots but we're not explicitly creating snapshots with this action.
Next, let's take a look at the size of the directory the files we loaded:
hdfs dfs -du -h /tmp/snapshot_dir
The output contains 2 numbers, the first is the size of the file and the other is the size of the file + replication. We have 3x replication by default with HDFS and that second number is a multiple of 3.
Next, let's create the snapshot. To do this we'll need to identify the directory we want to snapshot as well as a name for a snapshot. I'd recommend a date format like example below to easily keep track of when the snapshot was taken. Note that the directory that the snapshot is taken in will take a snapshot of all the files directories under that directory. Also you can't created nested snapshots so be judicious in your selection.
Once that is completed, let's check the size of the directory again. As you can see, the directory size didn't increase. Thats because the snapshots are point in time logical backups of your data. A pointer is created in the namenode that links the snapshots to the files on disk. If a deletion happens, the name node drops it's is logical reference to that data but the data physically remains. The snapshot acts as a secondary reference that can be used to recover the files logically and restore the namename node reference. To test this let's delete a file:
hdfs dfs -rm /tmp/snapshot_dir/dir1/file1.txt
First make sure the file is removed:
hdfs dfs -ls /tmp/snapshot_dir/dir1/
Now check the directory size:
hdfs dfs -du -h /tmp/snapshot_dir
Notice that while the file size is 0 because it doesn't logically exist, the second number (replication size) is still populated. Thats because while the file does't logically exist, it is still physically present. This is important to remember because if you delete files with snapshots they aren't physically deleted unless the snapshot holding them is deleted. This can result in lots of "phantom" data in the system taking up valuable HDFS real state.
Now let's restore a file. In this process, you navigate to the hidden snapshot folder that holds the individual snapshots. This will be located where you took the first snapshot:
hdfs dfs -ls /tmp/snapshot_dir/dir1/.snapshot
In here you should see the snapshot we took previously. Now, we can copy the file from the content of the snapshot and restore the file. Not the flags for -ptopax restores the file with the same timestamp, ownership, permissions, ACL's and XAttrs as the original.
The file sizes look good but the system sizes are off the charts. We expect 36 bytes but it comes in at 72. Thats because while we copied the file from the snapshot, the original snapshot copy of the file still remains on disk...so now we have 2 copies of the same file, one referenced directly by the name node and the other by the snapshot.
To remove this phantom data, we must delete the snapshot: