Support Questions

Find answers, ask questions, and share your expertise

HBase master fails to start

avatar
Contributor

Hello,

HBase master wouldn't start spitting the below error.

2016-05-13 03:56:53,431 FATAL [ip-172-31-21-148:16000.activeMasterManager] master.HMaster: Failed to become active master java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned

It appears to be started okay in Ambari view, but the it shows stopped in about 3 minutes.

FYI. it's a Vanilla HDP-2.4.0.0-169 install plus some configuration changes on disks, HDFS, but I haven't created anything on Hbase.

The attached is the screenshot of the HBase view of Ambari.

4224-screen-shot-2016-05-13-at-61503-pm.png

Below are the log at the time of failure:

/var/log/hbase/hbase-hbase-master-serverhost.log

Could you please advise on where to look at to make it started?

Thanks,

2016-05-13 03:56:41,651 INFO  [ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935_splitLogManager__ChoreService_1] master.SplitLogManager: total tasks = 1 unassigned = 0 tasks={/hbase-unsecure/splitWAL/WALs%2Fip-172-31-25-148.ap-southeast-2.compute.internal%2C16020%2C1463025987708-splitting%2Fip-172-31-25-148.ap-southeast-2.compute.internal%252C16020%252C1463025987708.default.1463030679002=last_update = 1463126191938 last_version = 399 cur_worker_name = ip-172-31-28-128.ap-southeast-2.compute.internal,16020,1463125906252 status = in_progress incarnation = 1 resubmits = 1 batch = installed = 1 done = 0 error = 0}
2016-05-13 03:56:47,651 INFO  [ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935_splitLogManager__ChoreService_1] master.SplitLogManager: total tasks = 1 unassigned = 0 tasks={/hbase-unsecure/splitWAL/WALs%2Fip-172-31-25-148.ap-southeast-2.compute.internal%2C16020%2C1463025987708-splitting%2Fip-172-31-25-148.ap-southeast-2.compute.internal%252C16020%252C1463025987708.default.1463030679002=last_update = 1463126191938 last_version = 399 cur_worker_name = ip-172-31-28-128.ap-southeast-2.compute.internal,16020,1463125906252 status = in_progress incarnation = 1 resubmits = 1 batch = installed = 1 done = 0 error = 0}
2016-05-13 03:56:53,431 FATAL [ip-172-31-21-148:16000.activeMasterManager] master.HMaster: Failed to become active master
java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned
    at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:104)
    at org.apache.hadoop.hbase.master.HMaster.initNamespace(HMaster.java:1005)
    at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:799)
    at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:191)
    at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1783)
    at java.lang.Thread.run(Thread.java:745)
2016-05-13 03:56:53,433 FATAL [ip-172-31-21-148:16000.activeMasterManager] master.HMaster: Master server abort: loaded coprocessors are: []
2016-05-13 03:56:53,433 FATAL [ip-172-31-21-148:16000.activeMasterManager] master.HMaster: Unhandled exception. Starting shutdown.
java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned
    at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:104)
    at org.apache.hadoop.hbase.master.HMaster.initNamespace(HMaster.java:1005)
    at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:799)
    at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:191)
    at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1783)
    at java.lang.Thread.run(Thread.java:745)
2016-05-13 03:56:53,433 INFO  [ip-172-31-21-148:16000.activeMasterManager] regionserver.HRegionServer: STOPPED: Unhandled exception. Starting shutdown.
2016-05-13 03:56:53,433 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] regionserver.HRegionServer: Stopping infoServer
2016-05-13 03:56:53,456 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] mortbay.log: Stopped SelectChannelConnector@0.0.0.0:16010
2016-05-13 03:56:53,500 WARN  [MASTER_SERVER_OPERATIONS-ip-172-31-21-148:16000-1] master.SplitLogManager: Stopped while waiting for log splits to be completed
2016-05-13 03:56:53,500 WARN  [MASTER_SERVER_OPERATIONS-ip-172-31-21-148:16000-1] master.SplitLogManager: error while splitting logs in [hdfs://ip-172-31-23-165.ap-southeast-2.compute.internal:8020/apps/hbase/data/WALs/ip-172-31-25-148.ap-southeast-2.compute.internal,16020,1463025987708-splitting] installed = 1 but only 0 done
2016-05-13 03:56:53,501 ERROR [MASTER_SERVER_OPERATIONS-ip-172-31-21-148:16000-1] executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN
java.io.IOException: failed log splitting for ip-172-31-25-148.ap-southeast-2.compute.internal,16020,1463025987708, will retry
    at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.resubmit(ServerShutdownHandler.java:357)
    at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:220)
    at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error or interrupted while splitting logs in [hdfs://ip-172-31-23-165.ap-southeast-2.compute.internal:8020/apps/hbase/data/WALs/ip-172-31-25-148.ap-southeast-2.compute.internal,16020,1463025987708-splitting] Task = installed = 1 done = 0 error = 0
    at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:290)
    at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:391)
    at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:364)
    at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:286)
    at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:213)
    ... 4 more
2016-05-13 03:56:53,502 ERROR [MASTER_SERVER_OPERATIONS-ip-172-31-21-148:16000-3] executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN
java.io.IOException: Server is stopped
    at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:194)
    at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
2016-05-13 03:56:53,556 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] procedure2.ProcedureExecutor: Stopping the procedure executor
2016-05-13 03:56:53,557 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] wal.WALProcedureStore: Stopping the WAL Procedure Store
2016-05-13 03:56:53,656 INFO  [ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935_splitLogManager__ChoreService_1] master.SplitLogManager$TimeoutMonitor: Chore: SplitLogManager Timeout Monitor was stopped
2016-05-13 03:56:53,691 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] regionserver.HRegionServer: stopping server ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935
2016-05-13 03:56:53,691 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x154a916c5040001
2016-05-13 03:56:53,692 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] zookeeper.ZooKeeper: Session: 0x154a916c5040001 closed
2016-05-13 03:56:53,693 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000-EventThread] zookeeper.ClientCnxn: EventThread shut down
2016-05-13 03:56:53,703 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] regionserver.HRegionServer: stopping server ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935; all regions closed.
2016-05-13 03:56:53,703 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] hbase.ChoreService: Chore service for: ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935 had [[ScheduledChore: Name: ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935-BalancerChore Period: 300000 Unit: MILLISECONDS], [ScheduledChore: Name: ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935-ClusterStatusChore Period: 60000 Unit: MILLISECONDS], [ScheduledChore: Name: ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935-RegionNormalizerChore Period: 1800000 Unit: MILLISECONDS], [ScheduledChore: Name: HFileCleaner Period: 60000 Unit: MILLISECONDS], [ScheduledChore: Name: LogsCleaner Period: 60000 Unit: MILLISECONDS], [ScheduledChore: Name: CatalogJanitor-ip-172-31-21-148:16000 Period: 300000 Unit: MILLISECONDS]] on shutdown
2016-05-13 03:56:53,706 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x154a916c5040002
2016-05-13 03:56:53,707 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] zookeeper.ZooKeeper: Session: 0x154a916c5040002 closed
2016-05-13 03:56:53,707 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] hbase.ChoreService: Chore service for: ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935_splitLogManager_ had [] on shutdown
2016-05-13 03:56:53,707 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] flush.MasterFlushTableProcedureManager: stop: server shutting down.
2016-05-13 03:56:53,707 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] ipc.RpcServer: Stopping server on 16000
2016-05-13 03:56:53,707 INFO  [ip-172-31-21-148:16000.activeMasterManager-EventThread] zookeeper.ClientCnxn: EventThread shut down
2016-05-13 03:56:53,716 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] zookeeper.RecoverableZooKeeper: Node /hbase-unsecure/rs/ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935 already deleted, retry=false
2016-05-13 03:56:53,718 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] zookeeper.ZooKeeper: Session: 0x154a916c5040000 closed
2016-05-13 03:56:53,718 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] regionserver.HRegionServer: stopping server ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935; zookeeper connection closed.
2016-05-13 03:56:53,718 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] regionserver.HRegionServer: master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000 exiting
2016-05-13 03:56:53,718 INFO  [main-EventThread] zookeeper.ClientCnxn: EventThread shut down
1 ACCEPTED SOLUTION

avatar
Super Guru

Seems to be an issue with stale WAL splits, try removing WAL's from below hdfs location and restart the Hbase master and region services.

/apps/hbase/data/WALs/

View solution in original post

12 REPLIES 12

avatar
Explorer

In our experience, this seems to be related to https://issues.apache.org/jira/browse/HBASE-14729 which is solved in HBase 2.0.0 (we are running 1.1.2 with hdp 2.5)

avatar
Super Guru

Please re-read the description on HBASE-14729. There no code-changes made by that JIRA issue -- it was closed as a duplicate of https://issues.apache.org/jira/browse/HBASE-14223 which is still outstanding.

avatar
New Contributor

so what's the best to resolve this issue for production