I am using HBase snapshots for the purpose of backups in my cluster. I have weekly snapshots to facilitate recovery from HBase failure. However something concerns me. I was under the impression that HBase snapshots stored only metadata without replicating any data, making them ideal for low footprint backups. However, after a short time (3+ weeks) a snapshot will often be exactly the same size as the source table, sharing 0% of the data with the source table. This is a problem since it means that keeping even a few weeks of backups can consume 25+ TB of space.
Can anybody explain to me why this happens, and if there is any way to avoid it?
"I was under the impression that HBase snapshots stored only metadata without replicating any data"
Your impression is incorrect. Snapshot creation is a quick/fast operation as it does not require copying any data. A snapshot is just a reference to a list of files in HDFS that HBase is using. As you continue to write more data into HBase, compactions occur which read old files and create new files.
Normally, these old files are deleted. However, your snapshot is referring to these old files. You can't get backups for no-cost, you eventually have to own the cost of storing that data. Please make sure to (re)read the section on snapshots in the HBase book https://hbase.apache.org/book.html#ops.snapshots -- it is very thorough and covers this topic in much more detail than I have.
Long-term, you can consider the incremental backup-and-restore work which is on-going as an alternative to snapshots https://hortonworks.com/blog/coming-hdp-2-5-incremental-backup-restore-apache-hbase-apache-phoenix/