Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Is HDFS Snapshot applicable to a very large file such TB?

avatar
Explorer
 
1 ACCEPTED SOLUTION

avatar
Master Mentor

@Hamilton Castro

The simple and clear answer is "YES" !!

HDFS Snapshots are read-only point-in-time copies of the file system. They can be taken on any level of the file system. Snapshot is valuable as a backup or for Business continuity plans as a Disaster recovery option.


The concept of a snapshot can be considered Point-in-Time [PIT] backup, which is wrong if you had a 5TB the snapshot will not be the same size, an HDFS snapshot is not a full copy of the data, rather a copy of the metadata at that point in time. Blocks in data nodes are not copied: the snapshot files record the block list and the file size. There is no data copying (more accurately a new record in the inode). It's only on modifications (appends and truncates for HDFS) that record any data.


The snapshot data is computed by subtracting the modifications from the current data. The modifications are recorded in chronological order, so that the current data can be accessed directly. To take snapshots, the HDFS directory has to be set as a snapshot table. If there are snapshots in a snapshottable directory, the directory cannot be deleted nor renamed.


So when you first take a snapshot, your HDFS storage usage will stay the same. It is only when you modify the data that data is copied/written. Copying data between clusters or storage systems, copying a snapshotted file is no different than copying a regular file - they both will copy the same way, with bytes and with metadata. There's no "copy only metadata" operation.

View solution in original post

3 REPLIES 3

avatar
Explorer

@Hamilton Castro

Are these snapshots from Hbase stored in HDFS ?


Thanks

Krishna

avatar
Master Mentor

@Hamilton Castro

The simple and clear answer is "YES" !!

HDFS Snapshots are read-only point-in-time copies of the file system. They can be taken on any level of the file system. Snapshot is valuable as a backup or for Business continuity plans as a Disaster recovery option.


The concept of a snapshot can be considered Point-in-Time [PIT] backup, which is wrong if you had a 5TB the snapshot will not be the same size, an HDFS snapshot is not a full copy of the data, rather a copy of the metadata at that point in time. Blocks in data nodes are not copied: the snapshot files record the block list and the file size. There is no data copying (more accurately a new record in the inode). It's only on modifications (appends and truncates for HDFS) that record any data.


The snapshot data is computed by subtracting the modifications from the current data. The modifications are recorded in chronological order, so that the current data can be accessed directly. To take snapshots, the HDFS directory has to be set as a snapshot table. If there are snapshots in a snapshottable directory, the directory cannot be deleted nor renamed.


So when you first take a snapshot, your HDFS storage usage will stay the same. It is only when you modify the data that data is copied/written. Copying data between clusters or storage systems, copying a snapshotted file is no different than copying a regular file - they both will copy the same way, with bytes and with metadata. There's no "copy only metadata" operation.

avatar

The above question and the entire reoply thread below was originally posted in the Community Help track. On Sun Jun 30 17:30 UTC 2019, a member of the HCC moderation staff moved it to the Hadoop Core track. The Community Help Track is intended for questions about using the HCC site itself, not technical questions about HDFS.

Bill Brooks, Community Moderator
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.