Community Articles

jtaras · ‎08-19-2019

HDFS Snapshots

In Part 1 we looked at the basics of HDFS snapshots. In next section we'll look at what happens with managing multiple snapshots in a given directory and what to look out for... primarily phantom data that exists when you delete a dataset that has a snapshot or multiple snapshots linked to it.

Working With Multiple Snapshots

First lets take a new snapshot on the directory we were previously working on:

hdfs dfs -createSnapshot /tmp/snapshot_dir/dir1 20190819v1

Next lets add another file in the same base directory that we have file1.txt in:

hdfs dfs -put file2.txt /tmp/snapshot_dir/dir1

Let's pretend that 8 hours later, you take another snapshot of the same directory. Here we named the snapshot v2:

hdfs dfs -createSnapshot /tmp/snapshot_dir/dir1 20190819v2

Let's pause and take a look at the size of the directory:

hdfs dfs -du -h /tmp/snapshot_dir

Screen Shot 2019-08-19 at 4.58.02 PM.png

Now lets delete file1.txt:

hdfs dfs -rm /tmp/snapshot_dir/dir1/file1.txt

Take a look at the directory size now:

hdfs dfs -du -h /tmp/snapshot_dir

Screen Shot 2019-08-19 at 4.23.09 PM.png

Only one file remains in the directory (file2.txt) but we still see the physically file represented in the second number...those phantom files!

Let's go a little bit further and load another file and take another snapshot. Let's pretend it's 8 hours later and we name the snapshot v3:

hdfs dfs -put file3.txt /tmp/snapshot_dir/dir1

Next take a look at the directory sizes for reference:

hdfs dfs -du -h /tmp/snapshot_dir
hdfs dfs -du -h /tmp/snapshot_dir/dir1

Screen Shot 2019-08-19 at 4.43.00 PM.png

Screen Shot 2019-08-19 at 4.44.19 PM.png

Note how the the replication number for both images above still show phantom data being present. The two files written have a combined size (with rep) of 111 bytes. The other 36 are from the phantom data that exists from the first snapshot still being live despite the data being deleted. This data will always remain UNTIL the last snapshot that references that file is deleted.

Lets next take a look at the snapshots that we just recently created (v2 and v3):

hdfs dfs -ls /tmp/snapshot_dir/dir1/.snapshot
hdfs dfs -ls /tmp/snapshot_dir/dir1/.snapshot/20190819v2
hdfs dfs -ls /tmp/snapshot_dir/dir1/.snapshot/20190819v3
hdfs dfs -ls /tmp/snapshot_dir/dir1/.snapshot/20190819v1

Screen Shot 2019-08-19 at 4.48.28 PM.png

Notice how the v2 snapshot has a reference to file1.txt. Even if we delete the v1 snapshot, the phantom data will still remain. To test this, lets delete the v1 snapshot:

hdfs dfs -deleteSnapshot /tmp/snapshot_dir/dir1 20190819v1

Screen Shot 2019-08-19 at 4.50.51 PM.png

As expected, deleting the first snapshot didn't help delete the phantom data. Now, let's delete the v2 snapshot and see what happens:

hdfs dfs -deleteSnapshot /tmp/snapshot_dir/dir1 20190819v2

Screen Shot 2019-08-19 at 4.51.05 PM.png

As you can see, the phantom data is gone. The snapshot that was holding reference to it was deleted and with it went the data.

Conclusion

HDFS Snapshots can be a very powerful tool but one must exercise caution when using them. Protection from data deletion may be great but it comes at a cost. If implementing hdfs snapshotting you must create management framework for keeping track of snapshots to ensure proper HDFS space utilization.

kwabstian53 · ‎08-28-2019

Nice piece, Can you snapshot to a location outside the repository?

Cloudera Community

Community Articles

HDFS Snapshots Basics Part II

HDFS

HDFS Snapshots

Working With Multiple Snapshots

Conclusion

Re: HDFS Snapshots Basics Part II