Can someone confirm if snapshot enabled on a location occupy space in HDFS ?
For ex: HDFS location: /a/b/c is the only location in HDFS and occupies 9 TB post replication(3x).
it looks like:
3.0 9.0T /a/b/c
Question: If after enabling snapshot on this location will total HDFS utilization at cluster level increase to 18.0 TB?
The snapshot will not occupy any storage space on disk or NameNode heap immediately after it is created.
However any subsequent changes inside the snapshottable directory will need to be tracked as deltas and that can result in both higher disk space and NameNode heap usage. E.g. if a file is deleted after taking a snapshot, the blocks cannot be reclaimed because the file is still accessible through the snapshot path.
The hadoop fs -du shell command supports a -x option that allows calculating directory space usage excluding snapshots. The delta between the output with and without the -x option will tell you how much disk space is being consumed by the snapshot.
To clarify what is the magnitude of the size we are talking about here when you mention - " directory will need to be tracked as deltas and that can result in both higher disk space and NameNode heap usage".
I'm assuming you mean just to store the metadata of the changed snapshot and which isn't significant given the actual size of data held(in reference to my example above) , if not please clarify
Deletes are typically the things that would affect space the most in a folder with snapshots enabled. So if you assume that you replaced a file or performed a CTAS to refresh a table then the old files will still there until the snapshot is deleted.
I'm assuming you mean just to store the metadata of the changed snapshot and which isn't significant given the actual size of data held(in reference to my example above)
Correct. However the metadata is tracked in NameNode memory which is a precious resource. The overhead can be significant in a large cluster with many files and millions of deltas.
Thanks, to be on the same page taking help of below scenario:
hdfs snapshottable location /a/b/ has a file c which is snapshotted. Consider a scenario where c is deleted from hdfs using cli hdfs -rm -r -skipTrash (NN transaction happened and hdfs cli command doesn't show up the file anymore) and then a new file is created with same content/size and name.
- What gets stored in hdfs? whats the delta that snapshot add in hdfs in this case?
--> is it just that snapshot still holds c as block in hdfs in addition to the same file that was created in hdfs
--> NN resource used to maintain both of their metadata in heap?
is this all or there is more to it .