Created on 08-19-201902:02 PM - edited 09-16-202201:45 AM
HDFS Snapshots
In Part 1 we looked at the basics of HDFS snapshots. In next section we'll look at what happens with managing multiple snapshots in a given directory and what to look out for... primarily phantom data that exists when you delete a dataset that has a snapshot or multiple snapshots linked to it.
Working With Multiple Snapshots
First lets take a new snapshot on the directory we were previously working on:
Note how the the replication number for both images above still show phantom data being present. The two files written have a combined size (with rep) of 111 bytes. The other 36 are from the phantom data that exists from the first snapshot still being live despite the data being deleted. This data will always remain UNTIL the last snapshot that references that file is deleted.
Lets next take a look at the snapshots that we just recently created (v2 and v3):
Notice how the v2 snapshot has a reference to file1.txt. Even if we delete the v1 snapshot, the phantom data will still remain. To test this, lets delete the v1 snapshot:
As you can see, the phantom data is gone. The snapshot that was holding reference to it was deleted and with it went the data.
Conclusion
HDFS Snapshots can be a very powerful tool but one must exercise caution when using them. Protection from data deletion may be great but it comes at a cost. If implementing hdfs snapshotting you must create management framework for keeping track of snapshots to ensure proper HDFS space utilization.