Created 06-28-2019 08:03 PM
Created 06-29-2019 06:48 PM
The simple and clear answer is "YES" !!
HDFS Snapshots are read-only point-in-time copies of the file system. They can be taken on any level of the file system. Snapshot is valuable as a backup or for Business continuity plans as a Disaster recovery option.
The concept of a snapshot can be considered Point-in-Time [PIT] backup, which is wrong if you had a 5TB the snapshot will not be the same size, an HDFS snapshot is not a full copy of the data, rather a copy of the metadata at that point in time. Blocks in data nodes are not copied: the snapshot files record the block list and the file size. There is no data copying (more accurately a new record in the inode). It's only on modifications (appends and truncates for HDFS) that record any data.
The snapshot data is computed by subtracting the modifications from the current data. The modifications are recorded in chronological order, so that the current data can be accessed directly. To take snapshots, the HDFS directory has to be set as a snapshot table. If there are snapshots in a snapshottable directory, the directory cannot be deleted nor renamed.
So when you first take a snapshot, your HDFS storage usage will stay the same. It is only when you modify the data that data is copied/written. Copying data between clusters or storage systems, copying a snapshotted file is no different than copying a regular file - they both will copy the same way, with bytes and with metadata. There's no "copy only metadata" operation.
Created 06-28-2019 08:36 PM
Created 06-29-2019 06:48 PM
The simple and clear answer is "YES" !!
HDFS Snapshots are read-only point-in-time copies of the file system. They can be taken on any level of the file system. Snapshot is valuable as a backup or for Business continuity plans as a Disaster recovery option.
The concept of a snapshot can be considered Point-in-Time [PIT] backup, which is wrong if you had a 5TB the snapshot will not be the same size, an HDFS snapshot is not a full copy of the data, rather a copy of the metadata at that point in time. Blocks in data nodes are not copied: the snapshot files record the block list and the file size. There is no data copying (more accurately a new record in the inode). It's only on modifications (appends and truncates for HDFS) that record any data.
The snapshot data is computed by subtracting the modifications from the current data. The modifications are recorded in chronological order, so that the current data can be accessed directly. To take snapshots, the HDFS directory has to be set as a snapshot table. If there are snapshots in a snapshottable directory, the directory cannot be deleted nor renamed.
So when you first take a snapshot, your HDFS storage usage will stay the same. It is only when you modify the data that data is copied/written. Copying data between clusters or storage systems, copying a snapshotted file is no different than copying a regular file - they both will copy the same way, with bytes and with metadata. There's no "copy only metadata" operation.
Created 06-30-2019 05:31 PM
The above question and the entire reoply thread below was originally posted in the Community Help track. On Sun Jun 30 17:30 UTC 2019, a member of the HCC moderation staff moved it to the Hadoop Core track. The Community Help Track is intended for questions about using the HCC site itself, not technical questions about HDFS.