Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Is HDFS Snapshot applicable to a very large file such TB?

Solved Go to solution
Highlighted

Is HDFS Snapshot applicable to a very large file such TB?

New Contributor
 
1 ACCEPTED SOLUTION

Accepted Solutions

Re: Is HDFS Snapshot applicable to a very large file such TB?

Mentor

@Hamilton Castro

The simple and clear answer is "YES" !!

HDFS Snapshots are read-only point-in-time copies of the file system. They can be taken on any level of the file system. Snapshot is valuable as a backup or for Business continuity plans as a Disaster recovery option.


The concept of a snapshot can be considered Point-in-Time [PIT] backup, which is wrong if you had a 5TB the snapshot will not be the same size, an HDFS snapshot is not a full copy of the data, rather a copy of the metadata at that point in time. Blocks in data nodes are not copied: the snapshot files record the block list and the file size. There is no data copying (more accurately a new record in the inode). It's only on modifications (appends and truncates for HDFS) that record any data.


The snapshot data is computed by subtracting the modifications from the current data. The modifications are recorded in chronological order, so that the current data can be accessed directly. To take snapshots, the HDFS directory has to be set as a snapshot table. If there are snapshots in a snapshottable directory, the directory cannot be deleted nor renamed.


So when you first take a snapshot, your HDFS storage usage will stay the same. It is only when you modify the data that data is copied/written. Copying data between clusters or storage systems, copying a snapshotted file is no different than copying a regular file - they both will copy the same way, with bytes and with metadata. There's no "copy only metadata" operation.

3 REPLIES 3

Re: Is HDFS Snapshot applicable to a very large file such TB?

New Contributor

@Hamilton Castro

Are these snapshots from Hbase stored in HDFS ?


Thanks

Krishna

Re: Is HDFS Snapshot applicable to a very large file such TB?

Mentor

@Hamilton Castro

The simple and clear answer is "YES" !!

HDFS Snapshots are read-only point-in-time copies of the file system. They can be taken on any level of the file system. Snapshot is valuable as a backup or for Business continuity plans as a Disaster recovery option.


The concept of a snapshot can be considered Point-in-Time [PIT] backup, which is wrong if you had a 5TB the snapshot will not be the same size, an HDFS snapshot is not a full copy of the data, rather a copy of the metadata at that point in time. Blocks in data nodes are not copied: the snapshot files record the block list and the file size. There is no data copying (more accurately a new record in the inode). It's only on modifications (appends and truncates for HDFS) that record any data.


The snapshot data is computed by subtracting the modifications from the current data. The modifications are recorded in chronological order, so that the current data can be accessed directly. To take snapshots, the HDFS directory has to be set as a snapshot table. If there are snapshots in a snapshottable directory, the directory cannot be deleted nor renamed.


So when you first take a snapshot, your HDFS storage usage will stay the same. It is only when you modify the data that data is copied/written. Copying data between clusters or storage systems, copying a snapshotted file is no different than copying a regular file - they both will copy the same way, with bytes and with metadata. There's no "copy only metadata" operation.

Re: Is HDFS Snapshot applicable to a very large file such TB?

Community Manager

The above question and the entire reoply thread below was originally posted in the Community Help track. On Sun Jun 30 17:30 UTC 2019, a member of the HCC moderation staff moved it to the Hadoop Core track. The Community Help Track is intended for questions about using the HCC site itself, not technical questions about HDFS.