Created on 05-13-2016 08:05 AM - edited 08-18-2019 06:25 AM
Hello,
HBase master wouldn't start spitting the below error.
2016-05-13 03:56:53,431 FATAL [ip-172-31-21-148:16000.activeMasterManager] master.HMaster: Failed to become active master java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned
It appears to be started okay in Ambari view, but the it shows stopped in about 3 minutes.
FYI. it's a Vanilla HDP-2.4.0.0-169 install plus some configuration changes on disks, HDFS, but I haven't created anything on Hbase.
The attached is the screenshot of the HBase view of Ambari.
Below are the log at the time of failure:
/var/log/hbase/hbase-hbase-master-serverhost.log
Could you please advise on where to look at to make it started?
Thanks,
2016-05-13 03:56:41,651 INFO [ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935_splitLogManager__ChoreService_1] master.SplitLogManager: total tasks = 1 unassigned = 0 tasks={/hbase-unsecure/splitWAL/WALs%2Fip-172-31-25-148.ap-southeast-2.compute.internal%2C16020%2C1463025987708-splitting%2Fip-172-31-25-148.ap-southeast-2.compute.internal%252C16020%252C1463025987708.default.1463030679002=last_update = 1463126191938 last_version = 399 cur_worker_name = ip-172-31-28-128.ap-southeast-2.compute.internal,16020,1463125906252 status = in_progress incarnation = 1 resubmits = 1 batch = installed = 1 done = 0 error = 0} 2016-05-13 03:56:47,651 INFO [ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935_splitLogManager__ChoreService_1] master.SplitLogManager: total tasks = 1 unassigned = 0 tasks={/hbase-unsecure/splitWAL/WALs%2Fip-172-31-25-148.ap-southeast-2.compute.internal%2C16020%2C1463025987708-splitting%2Fip-172-31-25-148.ap-southeast-2.compute.internal%252C16020%252C1463025987708.default.1463030679002=last_update = 1463126191938 last_version = 399 cur_worker_name = ip-172-31-28-128.ap-southeast-2.compute.internal,16020,1463125906252 status = in_progress incarnation = 1 resubmits = 1 batch = installed = 1 done = 0 error = 0} 2016-05-13 03:56:53,431 FATAL [ip-172-31-21-148:16000.activeMasterManager] master.HMaster: Failed to become active master java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:104) at org.apache.hadoop.hbase.master.HMaster.initNamespace(HMaster.java:1005) at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:799) at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:191) at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1783) at java.lang.Thread.run(Thread.java:745) 2016-05-13 03:56:53,433 FATAL [ip-172-31-21-148:16000.activeMasterManager] master.HMaster: Master server abort: loaded coprocessors are: [] 2016-05-13 03:56:53,433 FATAL [ip-172-31-21-148:16000.activeMasterManager] master.HMaster: Unhandled exception. Starting shutdown. java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:104) at org.apache.hadoop.hbase.master.HMaster.initNamespace(HMaster.java:1005) at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:799) at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:191) at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1783) at java.lang.Thread.run(Thread.java:745) 2016-05-13 03:56:53,433 INFO [ip-172-31-21-148:16000.activeMasterManager] regionserver.HRegionServer: STOPPED: Unhandled exception. Starting shutdown. 2016-05-13 03:56:53,433 INFO [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] regionserver.HRegionServer: Stopping infoServer 2016-05-13 03:56:53,456 INFO [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] mortbay.log: Stopped SelectChannelConnector@0.0.0.0:16010 2016-05-13 03:56:53,500 WARN [MASTER_SERVER_OPERATIONS-ip-172-31-21-148:16000-1] master.SplitLogManager: Stopped while waiting for log splits to be completed 2016-05-13 03:56:53,500 WARN [MASTER_SERVER_OPERATIONS-ip-172-31-21-148:16000-1] master.SplitLogManager: error while splitting logs in [hdfs://ip-172-31-23-165.ap-southeast-2.compute.internal:8020/apps/hbase/data/WALs/ip-172-31-25-148.ap-southeast-2.compute.internal,16020,1463025987708-splitting] installed = 1 but only 0 done 2016-05-13 03:56:53,501 ERROR [MASTER_SERVER_OPERATIONS-ip-172-31-21-148:16000-1] executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN java.io.IOException: failed log splitting for ip-172-31-25-148.ap-southeast-2.compute.internal,16020,1463025987708, will retry at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.resubmit(ServerShutdownHandler.java:357) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:220) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: error or interrupted while splitting logs in [hdfs://ip-172-31-23-165.ap-southeast-2.compute.internal:8020/apps/hbase/data/WALs/ip-172-31-25-148.ap-southeast-2.compute.internal,16020,1463025987708-splitting] Task = installed = 1 done = 0 error = 0 at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:290) at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:391) at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:364) at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:286) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:213) ... 4 more 2016-05-13 03:56:53,502 ERROR [MASTER_SERVER_OPERATIONS-ip-172-31-21-148:16000-3] executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN java.io.IOException: Server is stopped at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:194) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-05-13 03:56:53,556 INFO [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] procedure2.ProcedureExecutor: Stopping the procedure executor 2016-05-13 03:56:53,557 INFO [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] wal.WALProcedureStore: Stopping the WAL Procedure Store 2016-05-13 03:56:53,656 INFO [ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935_splitLogManager__ChoreService_1] master.SplitLogManager$TimeoutMonitor: Chore: SplitLogManager Timeout Monitor was stopped 2016-05-13 03:56:53,691 INFO [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] regionserver.HRegionServer: stopping server ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935 2016-05-13 03:56:53,691 INFO [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x154a916c5040001 2016-05-13 03:56:53,692 INFO [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] zookeeper.ZooKeeper: Session: 0x154a916c5040001 closed 2016-05-13 03:56:53,693 INFO [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000-EventThread] zookeeper.ClientCnxn: EventThread shut down 2016-05-13 03:56:53,703 INFO [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] regionserver.HRegionServer: stopping server ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935; all regions closed. 2016-05-13 03:56:53,703 INFO [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] hbase.ChoreService: Chore service for: ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935 had [[ScheduledChore: Name: ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935-BalancerChore Period: 300000 Unit: MILLISECONDS], [ScheduledChore: Name: ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935-ClusterStatusChore Period: 60000 Unit: MILLISECONDS], [ScheduledChore: Name: ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935-RegionNormalizerChore Period: 1800000 Unit: MILLISECONDS], [ScheduledChore: Name: HFileCleaner Period: 60000 Unit: MILLISECONDS], [ScheduledChore: Name: LogsCleaner Period: 60000 Unit: MILLISECONDS], [ScheduledChore: Name: CatalogJanitor-ip-172-31-21-148:16000 Period: 300000 Unit: MILLISECONDS]] on shutdown 2016-05-13 03:56:53,706 INFO [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x154a916c5040002 2016-05-13 03:56:53,707 INFO [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] zookeeper.ZooKeeper: Session: 0x154a916c5040002 closed 2016-05-13 03:56:53,707 INFO [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] hbase.ChoreService: Chore service for: ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935_splitLogManager_ had [] on shutdown 2016-05-13 03:56:53,707 INFO [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] flush.MasterFlushTableProcedureManager: stop: server shutting down. 2016-05-13 03:56:53,707 INFO [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] ipc.RpcServer: Stopping server on 16000 2016-05-13 03:56:53,707 INFO [ip-172-31-21-148:16000.activeMasterManager-EventThread] zookeeper.ClientCnxn: EventThread shut down 2016-05-13 03:56:53,716 INFO [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] zookeeper.RecoverableZooKeeper: Node /hbase-unsecure/rs/ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935 already deleted, retry=false 2016-05-13 03:56:53,718 INFO [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] zookeeper.ZooKeeper: Session: 0x154a916c5040000 closed 2016-05-13 03:56:53,718 INFO [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] regionserver.HRegionServer: stopping server ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935; zookeeper connection closed. 2016-05-13 03:56:53,718 INFO [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] regionserver.HRegionServer: master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000 exiting 2016-05-13 03:56:53,718 INFO [main-EventThread] zookeeper.ClientCnxn: EventThread shut down
Created 05-13-2016 08:45 AM
Seems to be an issue with stale WAL splits, try removing WAL's from below hdfs location and restart the Hbase master and region services.
/apps/hbase/data/WALs/
Created 05-13-2016 08:45 AM
Seems to be an issue with stale WAL splits, try removing WAL's from below hdfs location and restart the Hbase master and region services.
/apps/hbase/data/WALs/
Created 05-13-2016 11:56 AM
Hi, please accept this as an answer if that resolved the issue.
Created 05-13-2016 09:44 PM
This did the trick.
After removing WAL files and restarting HBase components, HBase master doesn't go down anymore.
I took some time to double check everything before accepting the answer.
Thank you so much.
Created 05-16-2016 01:18 PM
Please be aware that removing the write-ahead logs is causing data loss. Removing them is almost never an ideal situation (unless there is no data in HBase).
Created 05-17-2016 03:15 AM
Thanks Josh,
In the above case there was no data to preserve but I will be aware of the data loss by removing WAL files.
I don't understand then why this has happened in the first place. Can umounting and mounting HDFS data directory to the new directory, or forceful shutdown of the cluster hosts cause this?
I believe that this could be investigated separately.
Much appreciated.
Created 10-21-2016 05:27 PM
we got same issue , i removed the WALs files and restarted the cluster , still the issue is persistent . what might be the reason ??
Created 11-25-2016 11:14 AM
It solved my problem .... Thanks a lot Jitendra Yadav
Created 12-16-2016 10:03 AM
I am concerned that the accepted answer includes advice to delete data. In our case, the master node needed a few restarts before having settled on both which WALEdits were still in conflict and not getting TimeoutExceptions back from the data nodes. Eventually, it had all the outstanding edits resolved and went back to normal mode of operation.
Created 12-16-2016 03:44 PM
Your concern is very valid, Steen. Often times, this root problem is latency in the cluster recovering from a partially failed state. The system can often recover this on its own as you have noticed. Deleting the HBase data should *only* ever be used when the state is in an "empty state". It is not an acceptable answer for "real" systems.