Community Articles

Find and share helpful community-sourced technical articles.
Celebrating as our community reaches 100,000 members! Thank you!
Labels (1)
Cloudera Employee

HDFS Snapshots


In Part 1 we looked at the basics of HDFS snapshots. In next section we'll look at what happens with managing multiple snapshots in a given directory and what to look out for... primarily phantom data that exists when you delete a dataset that has a snapshot or multiple snapshots linked to it.  


Working With Multiple Snapshots


First lets take a new snapshot on the directory we were previously working on: 



hdfs dfs -createSnapshot /tmp/snapshot_dir/dir1 20190819v1




Next lets add another file in the same base directory that we have file1.txt in:




hdfs dfs -put file2.txt /tmp/snapshot_dir/dir1




Let's pretend that 8 hours later, you take another snapshot of the same directory. Here we named the snapshot v2:




hdfs dfs -createSnapshot /tmp/snapshot_dir/dir1 20190819v2




Let's pause and take a look at the size of the directory:




hdfs dfs -du -h /tmp/snapshot_dir




Screen Shot 2019-08-19 at 4.58.02 PM.png


Now lets delete file1.txt:




hdfs dfs -rm /tmp/snapshot_dir/dir1/file1.txt




 Take a look at the directory size now:




hdfs dfs -du -h /tmp/snapshot_dir




Screen Shot 2019-08-19 at 4.23.09 PM.png


Only one file remains in the directory (file2.txt) but we still see the physically file represented in the second number...those phantom files!


Let's go a little bit further and load another file and take another snapshot. Let's pretend it's 8 hours later and we name the snapshot v3:




hdfs dfs -put file3.txt /tmp/snapshot_dir/dir1




Next take a look at the directory sizes for reference:




hdfs dfs -du -h /tmp/snapshot_dir
hdfs dfs -du -h /tmp/snapshot_dir/dir1




Screen Shot 2019-08-19 at 4.43.00 PM.png


Screen Shot 2019-08-19 at 4.44.19 PM.png


Note how the the replication number for both images above still show phantom data being present. The two files written have a combined size (with rep) of 111 bytes. The other 36 are from the phantom data that exists from the first snapshot still being live despite the data being deleted. This data will always remain UNTIL the last snapshot that references that file is deleted. 


Lets next take a look at the snapshots that we just recently created (v2 and v3):



hdfs dfs -ls /tmp/snapshot_dir/dir1/.snapshot
hdfs dfs -ls /tmp/snapshot_dir/dir1/.snapshot/20190819v2
hdfs dfs -ls /tmp/snapshot_dir/dir1/.snapshot/20190819v3
hdfs dfs -ls /tmp/snapshot_dir/dir1/.snapshot/20190819v1




Screen Shot 2019-08-19 at 4.48.28 PM.png


Notice how the v2 snapshot has a reference to file1.txt. Even if we delete the v1 snapshot, the phantom data will still remain. To test this, lets delete the v1 snapshot:




hdfs dfs -deleteSnapshot /tmp/snapshot_dir/dir1 20190819v1




Screen Shot 2019-08-19 at 4.50.51 PM.png


As expected, deleting the first snapshot didn't help delete the phantom data. Now, let's delete the v2 snapshot and see what happens:




hdfs dfs -deleteSnapshot /tmp/snapshot_dir/dir1 20190819v2




Screen Shot 2019-08-19 at 4.51.05 PM.png


As you can see, the phantom data is gone. The snapshot that was holding reference to it was deleted and with it went the data. 




HDFS Snapshots can be a very powerful tool but one must exercise caution when using them. Protection from data deletion may be great but it comes at a cost. If implementing hdfs snapshotting you must create management framework for keeping track of snapshots to ensure proper HDFS space utilization.


Nice piece, Can you snapshot to a location outside the repository?