Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Does snapshot occupy space in HDFS.

avatar
Contributor

HI,

 

Can someone confirm if snapshot enabled on a location occupy space in HDFS ?

 

For ex: HDFS location: /a/b/c is the only location in HDFS and occupies 9 TB post replication(3x).

 

it looks like:

 

3.0  9.0T /a/b/c

 

Question: If after enabling snapshot on this location will total HDFS utilization at cluster level increase to 18.0 TB?

 

Regards

5 REPLIES 5

avatar

The snapshot will not occupy any storage space on disk or NameNode heap immediately after it is created.

 

However any subsequent changes inside the snapshottable directory will need to be tracked as deltas and that can result in both higher disk space and NameNode heap usage. E.g. if a file is deleted after taking a snapshot, the blocks cannot be reclaimed because the file is still accessible through the snapshot path.

 

The hadoop fs -du shell command supports a -x option that allows calculating directory space usage excluding snapshots. The delta between the output with and without the -x option will tell you how much disk space is being consumed by the snapshot.

avatar
Contributor

Thanks Arpit,

 

To clarify what is the magnitude of the size we are talking about here when you mention - " directory will need to be tracked as deltas and that can result in both higher disk space and NameNode heap usage".

 

I'm assuming you mean just to store the metadata of the changed snapshot and which isn't significant given the actual size of data held(in reference to my example above) , if not please clarify

 

Regards

avatar
New Contributor

Hi Prav,

 

Deletes are typically the things that would affect space the most in a folder with snapshots enabled. So if you assume that you replaced a file or performed a CTAS to refresh a table then the old files will still there until the snapshot is deleted.

Cheers
Lovan

avatar

I'm assuming you mean just to store the metadata of the changed snapshot and which isn't significant given the actual size of data held(in reference to my example above)

Correct. However the metadata is tracked in NameNode memory which is a precious resource. The overhead can be significant in a large cluster with many files and millions of deltas.

avatar
Contributor

Thanks, to be on the same page taking help of below scenario:

 

 hdfs snapshottable location /a/b/ has a file c which is snapshotted. Consider a scenario where c is deleted from hdfs using cli hdfs -rm -r -skipTrash (NN transaction happened and hdfs cli command doesn't show up the file anymore) and then a new file is created with same content/size and name.

 

- What gets stored in hdfs? whats the delta that snapshot add in hdfs in this case?

   --> is it just that snapshot still holds c as block in hdfs in addition to the same file that was created in hdfs

   --> NN resource used to maintain both of their metadata in heap?

 

is this all or there is more to it .

 

Regards