Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

HBase snapshot disk usage


HBase snapshot disk usage


I am using HBase snapshots for the purpose of backups in my cluster. I have weekly snapshots to facilitate recovery from HBase failure. However something concerns me. I was under the impression that HBase snapshots stored only metadata without replicating any data, making them ideal for low footprint backups. However, after a short time (3+ weeks) a snapshot will often be exactly the same size as the source table, sharing 0% of the data with the source table. This is a problem since it means that keeping even a few weeks of backups can consume 25+ TB of space.

Can anybody explain to me why this happens, and if there is any way to avoid it?


Re: HBase snapshot disk usage

"I was under the impression that HBase snapshots stored only metadata without replicating any data"

Your impression is incorrect. Snapshot creation is a quick/fast operation as it does not require copying any data. A snapshot is just a reference to a list of files in HDFS that HBase is using. As you continue to write more data into HBase, compactions occur which read old files and create new files.

Normally, these old files are deleted. However, your snapshot is referring to these old files. You can't get backups for no-cost, you eventually have to own the cost of storing that data. Please make sure to (re)read the section on snapshots in the HBase book -- it is very thorough and covers this topic in much more detail than I have.

Long-term, you can consider the incremental backup-and-restore work which is on-going as an alternative to snapshots

Don't have an account?
Coming from Hortonworks? Activate your account here