- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
oldWALs not getting cleared even with no replication
- Labels:
-
Apache Hadoop
-
Apache HBase
Created ‎04-18-2017 05:56 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Last week I was resizing HDP cluster and for that I decommissioned the datanode. Stopped Datanode and RegionServer. Formatted and resized volumes. Recommissioned and started regionserver.
Everything went well cluster is in good shape. But from that day the /apps/hbase/data/oldWALs folder started filling up and it's not stopping.
This is what I have tried so far in order:
- add hbase.replication=fase => restart (this worked for most people)
- add hbase.master.logcleaner.ttl=10min => restart
- add hbase.master.logcleaner.plugins=org.apache.hadoop.hbase.master.cleaner.TimeToLiveLogCleaner => restart
- full cluster restart (hbase,hdfs,zookeeper,ambari mertrics eveything)
I tried to run following but it has no logs for any of the class (LogCleaner, TimeToLiveLogCleaner, ReplicationLogCleaner)
cat /var/log/hbase/hbase-<hostname>.log.* | grep LogClean
Replication is disabled and I confirmed by executing 'list_peer' and it said replication is disabled.
I also checked RegionServer logs and it always has been moving WALs to oldWALs folder. (since the beginning) But it was getting cleared from oldWALs it seems. There is no trace of Cleaner class in any of the Hbase master logs.
Can anyone please help me debug this further? I appreciate the help 🙂
Thanks!
EDIT:
I further enabled replication and I see this on logs:
2017-04-18 12:52:41,908 INFO [hdpm01:16000.activeMasterManager] zookeeper.RecoverableZooKeeper: Process identifier=replicationLogCleaner connecting to ZooKeeper ensemble=<zk-address>:2181 2017-04-18 12:52:41,908 INFO [hdpm01:16000.activeMasterManager] zookeeper.ZooKeeper: Initiating client connection, connectString=<zk>:2181 sessionTimeout=1800000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@546df67f 2017-04-18 12:52:41,918 INFO [hdpm01:16000.activeMasterManager-SendThread(hdps03.labs.ops.use1d.i.riva.co:2181)] zookeeper.ClientCnxn: Opening socket connection to server <zk>/10.10.220.138:2181. Will not attempt to authenticate using SASL (unknown error) 2017-04-18 12:52:41,920 INFO [hdpm01:16000.activeMasterManager-SendThread(<zk>2181)] zookeeper.ClientCnxn: Socket connection established to <zk>/10.10.220.138:2181, initiating session 2017-04-18 12:52:41,924 INFO [hdpm01:16000.activeMasterManager-SendThread(<zk>:2181)] zookeeper.ClientCnxn: Session establishment complete on server <zk>/10.10.220.138:2181, sessionid = 0x35b808847460065, negotiated timeout = 40000 2017-04-18 12:52:41,955 INFO [hdpm01:16000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
I was able to narrow it down further by enabling DEBUG logs. It says
2017-04-18 13:22:42,046 DEBUG [hdpm01.labs.ops.use1b.i.riva.co,16000,1492519955260_ChoreService_1] master.BackupLogCleaner: Didn't find this log in hbase:backup, keeping: hdfs://<master>:8020/apps/hbase/data/oldWALs/<rs-address>%2C16020%2C1492001909933..meta.1492232550969.meta ... 2017-04-18 13:22:42,166 DEBUG [hdpm01.labs.ops.use1b.i.riva.co,16000,1492519955260_ChoreService_1] impl.BackupSystemTable: Check if WAL file has been already backed up in hbase:backup hdfs://<master>:8020/apps/hbase/data/oldWALs/<rs-address>%2C16020%2C1492434572100.default.1492501877892
Created ‎04-19-2017 06:32 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The last debug lines helped me ID the cause. It was the hbase backup utility that was causing the failure to remove oldWALs.
The command below failed:
hbase backup full <s3-url> -t <table>
and that was verified using
hbase backup history
So to remove the failed backups
hbase backup delete <backup-id>
and the next moment, it all cleared 😄
this was pretty edge case and it was mentioned nowhere on internet. Hope this helps someone.
Created ‎04-18-2017 06:38 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@sanket patel intermittent zk issues can lead to cleaner chors failing.
Created ‎04-19-2017 06:43 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thanks @ssingla , I found the issue. And thanks for pointing out something related, might help in future.
Created ‎04-19-2017 06:32 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The last debug lines helped me ID the cause. It was the hbase backup utility that was causing the failure to remove oldWALs.
The command below failed:
hbase backup full <s3-url> -t <table>
and that was verified using
hbase backup history
So to remove the failed backups
hbase backup delete <backup-id>
and the next moment, it all cleared 😄
this was pretty edge case and it was mentioned nowhere on internet. Hope this helps someone.
