We are having an issue with running out of disk space on HDFS. A little investigation has shown that the largest directory, by far, is /apps/hbase/data/archive. As I understand it, this directory keeps HFiles that need to be kept, typically because of snapshots. I know that a large archive directory is the first culprit of having too many snapshots.
However snapshots do not seem to be the issue here: /apps/hbase/data/archive is a little larger than110TB, while the sum of all of our snapshots is <50TB.
We have not set hbase.master.hfilecleaner.ttl, however I have read that the default is 5 minutes - this is definitely not the culprit for many of the HFiles we have, which frequently date back many months.
What steps can I follow to try to reduce this usage?
Depending on the version of HDP you're running, the backup and restore work may be running by default. You can try to set hbase.backup.enable to false in hbase-site.xml and see if that will automatically clean up the files.
To try to debug this further, you can try to enable TRACE logging on the package "org.apache.hadoop.hbase.master.cleaner" in the active HBase Master. Hopefully, this will give you some confirmation as to the Cleaner implementation (there are multiple running in the Master) which is requiring that the file be kept.
Thanks for the reply Josh. hbase.backup.enable is not defined on our cluster, so it defaults to true. I'll turn this to false and then see if things get to a more reasonable level. If that doesn't work I will turn on the TRACE logging, and update with extra information.
If it changes anything, we're running HDP 220.127.116.11
@Josh Elser Disabling hbase backups did not improve the situation. After sifting through the logs for the cleaner, I have identified the following series of warnings:
2017-11-13 06:29:11,808 WARN [server01,16000,1510545850564_ChoreService_1] master.ReplicationHFileCleaner: ReplicationHFileCleaner received abort, ignoring. Reason: Failed to get stat of replication hfile references node. 2017-11-13 06:29:11,808 WARN [server01,16000,1510545850564_ChoreService_1] master.ReplicationHFileCleaner: Failed to read hfile references from zookeeper, skipping checking deletable files 2017-11-13 06:29:11,808 WARN [server01,16000,1510545850564_ChoreService_1] zookeeper.ZKUtil: replicationHFileCleaner-0x15fb38de0a0007a, quorum=server01:2181,server02:2181,server03:2181, baseZNode=/hbase-unsecure Unable to get data of znode /hbase-unsecure/replication/hfile-refs org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/hfile-refs
These repeat multiple times. So it appears that the replication HFile cleaner is failing due to an issue with zookeeper. We recently had some fairly severe zookeeper issues, but things have returned to a completely stable state now, apart from this.
Do you have any advice for how I can move forward, either with forcing the HFile cleaner to run or with repairing the state of zookeeper?
Session expiration is often hard to track down. It can be a factor of JVM pauses (due to Garbage Collection) on either the client (HBase Master) or server (ZK Server) or it could be a result of a ZNode which has an inordinately large number of children.
The brute-force operation would be to disable your replication process, (potentially) drop the root znode, and re-enable replication, and then sync up the tables with an ExportSnapshot or CopyTable. This would eliminate the data in ZooKeeper being a problem.
The other course of action would be looking more at the Master log and ZooKeeper server log to understand why the ZK session is expiring (See https://zookeeper.apache.org/doc/trunk/images/state_dia.jpg for more details on the session lifecycle). A good first step would be checking the number of znodes under /hbase-unsecure/replication.