Support Questions

Find answers, ask questions, and share your expertise

oldWALs not getting cleared even with no replication

avatar
Expert Contributor

Last week I was resizing HDP cluster and for that I decommissioned the datanode. Stopped Datanode and RegionServer. Formatted and resized volumes. Recommissioned and started regionserver.

Everything went well cluster is in good shape. But from that day the /apps/hbase/data/oldWALs folder started filling up and it's not stopping.

This is what I have tried so far in order:

  • add hbase.replication=fase => restart (this worked for most people)
  • add hbase.master.logcleaner.ttl=10min => restart
  • add hbase.master.logcleaner.plugins=org.apache.hadoop.hbase.master.cleaner.TimeToLiveLogCleaner => restart
  • full cluster restart (hbase,hdfs,zookeeper,ambari mertrics eveything)

I tried to run following but it has no logs for any of the class (LogCleaner, TimeToLiveLogCleaner, ReplicationLogCleaner)

cat /var/log/hbase/hbase-<hostname>.log.* | grep LogClean 

Replication is disabled and I confirmed by executing 'list_peer' and it said replication is disabled.

I also checked RegionServer logs and it always has been moving WALs to oldWALs folder. (since the beginning) But it was getting cleared from oldWALs it seems. There is no trace of Cleaner class in any of the Hbase master logs.

Can anyone please help me debug this further? I appreciate the help 🙂

Thanks!

EDIT:

I further enabled replication and I see this on logs:

2017-04-18 12:52:41,908 INFO  [hdpm01:16000.activeMasterManager] zookeeper.RecoverableZooKeeper: Process identifier=replicationLogCleaner connecting to ZooKeeper ensemble=<zk-address>:2181
2017-04-18 12:52:41,908 INFO  [hdpm01:16000.activeMasterManager] zookeeper.ZooKeeper: Initiating client connection, connectString=<zk>:2181 sessionTimeout=1800000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@546df67f
2017-04-18 12:52:41,918 INFO  [hdpm01:16000.activeMasterManager-SendThread(hdps03.labs.ops.use1d.i.riva.co:2181)] zookeeper.ClientCnxn: Opening socket connection to server <zk>/10.10.220.138:2181. Will not attempt to authenticate using SASL (unknown error)
2017-04-18 12:52:41,920 INFO  [hdpm01:16000.activeMasterManager-SendThread(<zk>2181)] zookeeper.ClientCnxn: Socket connection established to <zk>/10.10.220.138:2181, initiating session
2017-04-18 12:52:41,924 INFO  [hdpm01:16000.activeMasterManager-SendThread(<zk>:2181)] zookeeper.ClientCnxn: Session establishment complete on server <zk>/10.10.220.138:2181, sessionid = 0x35b808847460065, negotiated timeout = 40000
2017-04-18 12:52:41,955 INFO  [hdpm01:16000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.

I was able to narrow it down further by enabling DEBUG logs. It says

2017-04-18 13:22:42,046 DEBUG [hdpm01.labs.ops.use1b.i.riva.co,16000,1492519955260_ChoreService_1] master.BackupLogCleaner: Didn't find this log in hbase:backup, keeping: hdfs://<master>:8020/apps/hbase/data/oldWALs/<rs-address>%2C16020%2C1492001909933..meta.1492232550969.meta
...

2017-04-18 13:22:42,166 DEBUG [hdpm01.labs.ops.use1b.i.riva.co,16000,1492519955260_ChoreService_1] impl.BackupSystemTable: Check if WAL file has been already backed up in hbase:backup hdfs://<master>:8020/apps/hbase/data/oldWALs/<rs-address>%2C16020%2C1492434572100.default.1492501877892
1 ACCEPTED SOLUTION

avatar
Expert Contributor

The last debug lines helped me ID the cause. It was the hbase backup utility that was causing the failure to remove oldWALs.

The command below failed:

hbase backup full <s3-url> -t <table>

and that was verified using

hbase backup history

So to remove the failed backups

hbase backup delete <backup-id>

and the next moment, it all cleared 😄

this was pretty edge case and it was mentioned nowhere on internet. Hope this helps someone.

View solution in original post

3 REPLIES 3

avatar
Rising Star

@sanket patel intermittent zk issues can lead to cleaner chors failing.

https://issues.apache.org/jira/browse/HBASE-15234

avatar
Expert Contributor

thanks @ssingla , I found the issue. And thanks for pointing out something related, might help in future.

avatar
Expert Contributor

The last debug lines helped me ID the cause. It was the hbase backup utility that was causing the failure to remove oldWALs.

The command below failed:

hbase backup full <s3-url> -t <table>

and that was verified using

hbase backup history

So to remove the failed backups

hbase backup delete <backup-id>

and the next moment, it all cleared 😄

this was pretty edge case and it was mentioned nowhere on internet. Hope this helps someone.