Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
Labels (1)
avatar
Cloudera Employee

HDFS Snapshots

 

HDFS Snapshots are a great way to backup important data on HDFS. It's extremely easy to implement and it helps safeguard your data from instances where a user or admin accidentally deletes data. In the article below, we'll walkthrough some simple examples of using snapshots and some of the gotchas to look out for when implementing them.  

 

Part 1: Understanding Snapshots

 

First lets create some files and directories for testing:

 

 

echo "Hello World" > file1.txt
echo "How are you" > file2.txt
echo "hdfs snapshots are great" > file3.txt

hdfs dfs -mkdir /tmp/snapshot_dir
hdfs dfs -mkdir /tmp/snapshot_dir/dir1

 

 

 

Next lets put file1.txt in the directory:

 

 

hdfs dfs -put file1.txt /tmp/snapshot_dir/dir1

 

 

 

Creating a snapshot is a really simple process. You will need a user with superuser permissions to create a snapshottable directory. With this process, we enable the directory to have snapshots but we're not explicitly creating snapshots with this action. 

 

 

hdfs dfsadmin -allowSnapshot /tmp/snapshot_dir/dir1

 

 

 

Next, let's take a look at the size of the directory the files we loaded:

 

 

hdfs dfs -du -h /tmp/snapshot_dir

 

 

 

Screen Shot 2019-08-19 at 3.48.47 PM.png

 

The output contains 2 numbers, the first is the size of the file and the other is the size of the file + replication. We have 3x replication by default with HDFS and that second number is a multiple of 3. 

 

Next, let's create the snapshot. To do this we'll need to identify the directory we want to snapshot as well as a name for a snapshot. I'd recommend a date format like example below to easily keep track of when the snapshot was taken. Note that the directory that the snapshot is taken in will take a snapshot of all the files directories under that directory. Also you can't created nested snapshots so be judicious in your selection. 

 

 

hdfs dfs -createSnapshot /tmp/snapshot_dir/dir1 20190819v1

 

 

 

Once that is completed, let's check the size of the directory again. As you can see, the directory size didn't increase. Thats because the snapshots are point in time logical backups of your data. A pointer is created in the namenode that links the snapshots to the files on disk. If a deletion happens, the name node drops it's is logical reference to that data but the data physically remains.  The snapshot acts as a secondary reference that can be used to recover the files logically and restore the namename node reference. To test this let's delete a file:

 

 

 

hdfs dfs -rm /tmp/snapshot_dir/dir1/file1.txt

 

 

 

First make sure the file is removed:

 

 

hdfs dfs -ls /tmp/snapshot_dir/dir1/

 

 

 

Now check the directory size:

 

 

hdfs dfs -du -h /tmp/snapshot_dir

 

 

 

 Screen Shot 2019-08-19 at 3.50.59 PM.png

 

Notice that while the file size is 0 because it doesn't logically exist, the second number (replication size) is still populated. Thats because while the file does't logically exist, it is still physically present. This is important to remember because if you delete files with snapshots they aren't physically deleted unless the snapshot holding them is deleted. This can result in lots of "phantom" data in the system taking up valuable HDFS real state. 

 

Now let's restore a file. In this process, you navigate to the hidden snapshot folder that holds the individual snapshots. This will be located where you took the first snapshot:

 

 

hdfs dfs -ls /tmp/snapshot_dir/dir1/.snapshot

 

 

 

In here you should see the snapshot we took previously. Now, we can copy the file from the content of the snapshot and restore the file. Not the flags for -ptopax restores the file with the same timestamp, ownership, permissions, ACL's and XAttrs as the original. 

 

 

hdfs dfs -cp -ptopax /tmp/snapshot_dir/dir1/.snapshot/20190819v1/file1.txt /tmp/snapshot_dir/dir1

 

 

 

Now that the data has been copied, lets take a look at the directory sizes:

 

 

hdfs dfs -du -h /tmp/snapshot_dir
hdfs dfs -ls /tmp/snapshot_dir/dir1/

 

 

 

The file sizes look good but the system sizes are off the charts. We expect 36 bytes but it comes in at 72. Thats because while we copied the file from the snapshot, the original snapshot copy of the file still remains on disk...so now we have 2 copies of the same file, one referenced directly by the name node and the other by the snapshot. 

 

Screen Shot 2019-08-19 at 3.46.54 PM.png

 

To remove this phantom data, we must delete the snapshot:

 

 

hdfs dfs -deleteSnapshot /tmp/snapshot_dir/dir1 20190819v1

 

 

 

With the snapshot deleted take a look at the size of the directory:

 

 

hdfs dfs -du -h /tmp/snapshot_dir

 

 

 

Phantom Data No More!Phantom Data No More!


Everything is back in order and has the expected number of bytes.

 

Look for  Part II as we examine what to look out for when managing multiple snapshots.

1,771 Views