Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How HDFS Snapshot works internally.

avatar
Contributor

Hi All,

I recently came to know about HDFS Snapshots, I know that HDFS Snapshot are readonly copy of Name node metadata and any accidentally deleted file can be recovered from HDFS Snapshots. Can someone please explain how HDFS Snapshots internally work and any criteria on maximum duration within which a deleted file has to be recovered from HDFS Snapshots? Assuming HDFS Snapshot is take before deletion of the file, can I recover a file which is deleted few weeks back from HDFS Snapshot? If yes what if memory blocks of deleted file is used to store data of new file, before the recovery of deleted file.

Please let me know if something is not clear.

1 ACCEPTED SOLUTION

avatar
Guru

hi @Vinay R

HDFS Snapshots are point in time copies of the filesystem and taken either on a dir or the entire FS, depending on the administrator's preferences/policies. When you take a snapshot using the -createSnapshot command on a dir, a ".snapshot" dir will be created (usually with a timestamp appended by default but can be something else if you wish). The blocks of data within that HDFS dir are then protected (meaning the dir becomes read-only), and any subsequent delete commands will alter the metadata stored in the namenode only. Since the blocks are preserved, one can use the snapshots to restore the data as well. There is no time-limit on snapshots, so you can recover blocks from a few weeks back in your example if someone took a snapshot of them before any delete commands. There are however upper limits on the number of simultaneous snapshots that can be taken (though it is large at 65536). When snapshots are being used, care should also be taken to ensure snapshots are also being cleaned up to avoid clogging up the system.

Here are a couple of useful links on Snapshots that you may want to review:

http://hortonworks.com/hadoop-tutorial/using-hdfs-snapshots-protect-important-enterprise-datasets/

https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html

As always, if you find this post useful, don't forgot to upvote and/or accept the answer.

View solution in original post

7 REPLIES 7

avatar
Guru

hi @Vinay R

HDFS Snapshots are point in time copies of the filesystem and taken either on a dir or the entire FS, depending on the administrator's preferences/policies. When you take a snapshot using the -createSnapshot command on a dir, a ".snapshot" dir will be created (usually with a timestamp appended by default but can be something else if you wish). The blocks of data within that HDFS dir are then protected (meaning the dir becomes read-only), and any subsequent delete commands will alter the metadata stored in the namenode only. Since the blocks are preserved, one can use the snapshots to restore the data as well. There is no time-limit on snapshots, so you can recover blocks from a few weeks back in your example if someone took a snapshot of them before any delete commands. There are however upper limits on the number of simultaneous snapshots that can be taken (though it is large at 65536). When snapshots are being used, care should also be taken to ensure snapshots are also being cleaned up to avoid clogging up the system.

Here are a couple of useful links on Snapshots that you may want to review:

http://hortonworks.com/hadoop-tutorial/using-hdfs-snapshots-protect-important-enterprise-datasets/

https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html

As always, if you find this post useful, don't forgot to upvote and/or accept the answer.

avatar
Contributor

Hi @Sonu Sahi,

Thanks for reply, assume that

1. my HDFS cluster is 95% full

2. I have a requirement to ingest new data of volume 10% the size of the cluster storage.

3. Assume I have taken HDFS snapshot of entire cluster and then deleted some 50 % of Cluster data so that I can ingest the data mentioned in Step 2.

I am not sure what exactly will happen when I ingest new data(10% the size of Cluster)to HDFS after step 3.

Please let me know if something is not clear.

avatar
Contributor

Hi @Sonu Sahi,

Did you get chance to see my comments above?

Regards...

avatar
Guru

Hi @Vinay R thanks for the follow up comment, I didn't see the update here on Mar 21.

Remember what I said above (and perhaps described more clearly in the tutorial I linked) about snapshots. When you take a HDFS Snapshot, the blocks become protected (think read-only). The SnapShot is recording what the NameNode state was at that point in time, but the blocks themselves remain in HDFS in a read-only state. Future deletes will affect the NameNode status only, because the blocks are immutable until the snapshot is manually removed by the admin. Therefore, unless I'm still not understanding your scenario, step #3 in your description will not result in free-ing up space to allow for the 10% addition. Deleting 50% of the cluster data after taking an entire snapshot will result in NameNode transactions only because the blocks will remain on the disk in a read-only status. If adding more disk/datanodes is not an option, you may want to focus on investigating what is occupying the space. Hortonworks has made this a bit easier in the latest and greatest versions of HDP 2.6 / Ambari 2.5.x by adding a HDFS Top N feature to help cluster admins focus on areas of pressure on the NameNode.

avatar
Guru

As always, if you find this post useful, don't forget to accept and/or upvote the answer.

avatar
Contributor

@Sonu Sahi, Thank you very much this information is very helpful.

avatar
New Contributor

we need to be careful while applying snapshot on file(s) , if file(s) gets updated too frequently then it create pressure on system to keep snapshot, moreover a snapshottable directory is only able to accommodate 65,536 simultaneous snapshots. it's make good sense to snapshot file(s)/directory which are seldom changed