Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

HBase master fails to start

avatar
Contributor

Hello,

HBase master wouldn't start spitting the below error.

2016-05-13 03:56:53,431 FATAL [ip-172-31-21-148:16000.activeMasterManager] master.HMaster: Failed to become active master java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned

It appears to be started okay in Ambari view, but the it shows stopped in about 3 minutes.

FYI. it's a Vanilla HDP-2.4.0.0-169 install plus some configuration changes on disks, HDFS, but I haven't created anything on Hbase.

The attached is the screenshot of the HBase view of Ambari.

4224-screen-shot-2016-05-13-at-61503-pm.png

Below are the log at the time of failure:

/var/log/hbase/hbase-hbase-master-serverhost.log

Could you please advise on where to look at to make it started?

Thanks,

2016-05-13 03:56:41,651 INFO  [ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935_splitLogManager__ChoreService_1] master.SplitLogManager: total tasks = 1 unassigned = 0 tasks={/hbase-unsecure/splitWAL/WALs%2Fip-172-31-25-148.ap-southeast-2.compute.internal%2C16020%2C1463025987708-splitting%2Fip-172-31-25-148.ap-southeast-2.compute.internal%252C16020%252C1463025987708.default.1463030679002=last_update = 1463126191938 last_version = 399 cur_worker_name = ip-172-31-28-128.ap-southeast-2.compute.internal,16020,1463125906252 status = in_progress incarnation = 1 resubmits = 1 batch = installed = 1 done = 0 error = 0}
2016-05-13 03:56:47,651 INFO  [ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935_splitLogManager__ChoreService_1] master.SplitLogManager: total tasks = 1 unassigned = 0 tasks={/hbase-unsecure/splitWAL/WALs%2Fip-172-31-25-148.ap-southeast-2.compute.internal%2C16020%2C1463025987708-splitting%2Fip-172-31-25-148.ap-southeast-2.compute.internal%252C16020%252C1463025987708.default.1463030679002=last_update = 1463126191938 last_version = 399 cur_worker_name = ip-172-31-28-128.ap-southeast-2.compute.internal,16020,1463125906252 status = in_progress incarnation = 1 resubmits = 1 batch = installed = 1 done = 0 error = 0}
2016-05-13 03:56:53,431 FATAL [ip-172-31-21-148:16000.activeMasterManager] master.HMaster: Failed to become active master
java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned
    at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:104)
    at org.apache.hadoop.hbase.master.HMaster.initNamespace(HMaster.java:1005)
    at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:799)
    at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:191)
    at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1783)
    at java.lang.Thread.run(Thread.java:745)
2016-05-13 03:56:53,433 FATAL [ip-172-31-21-148:16000.activeMasterManager] master.HMaster: Master server abort: loaded coprocessors are: []
2016-05-13 03:56:53,433 FATAL [ip-172-31-21-148:16000.activeMasterManager] master.HMaster: Unhandled exception. Starting shutdown.
java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned
    at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:104)
    at org.apache.hadoop.hbase.master.HMaster.initNamespace(HMaster.java:1005)
    at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:799)
    at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:191)
    at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1783)
    at java.lang.Thread.run(Thread.java:745)
2016-05-13 03:56:53,433 INFO  [ip-172-31-21-148:16000.activeMasterManager] regionserver.HRegionServer: STOPPED: Unhandled exception. Starting shutdown.
2016-05-13 03:56:53,433 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] regionserver.HRegionServer: Stopping infoServer
2016-05-13 03:56:53,456 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] mortbay.log: Stopped SelectChannelConnector@0.0.0.0:16010
2016-05-13 03:56:53,500 WARN  [MASTER_SERVER_OPERATIONS-ip-172-31-21-148:16000-1] master.SplitLogManager: Stopped while waiting for log splits to be completed
2016-05-13 03:56:53,500 WARN  [MASTER_SERVER_OPERATIONS-ip-172-31-21-148:16000-1] master.SplitLogManager: error while splitting logs in [hdfs://ip-172-31-23-165.ap-southeast-2.compute.internal:8020/apps/hbase/data/WALs/ip-172-31-25-148.ap-southeast-2.compute.internal,16020,1463025987708-splitting] installed = 1 but only 0 done
2016-05-13 03:56:53,501 ERROR [MASTER_SERVER_OPERATIONS-ip-172-31-21-148:16000-1] executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN
java.io.IOException: failed log splitting for ip-172-31-25-148.ap-southeast-2.compute.internal,16020,1463025987708, will retry
    at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.resubmit(ServerShutdownHandler.java:357)
    at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:220)
    at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error or interrupted while splitting logs in [hdfs://ip-172-31-23-165.ap-southeast-2.compute.internal:8020/apps/hbase/data/WALs/ip-172-31-25-148.ap-southeast-2.compute.internal,16020,1463025987708-splitting] Task = installed = 1 done = 0 error = 0
    at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:290)
    at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:391)
    at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:364)
    at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:286)
    at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:213)
    ... 4 more
2016-05-13 03:56:53,502 ERROR [MASTER_SERVER_OPERATIONS-ip-172-31-21-148:16000-3] executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN
java.io.IOException: Server is stopped
    at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:194)
    at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
2016-05-13 03:56:53,556 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] procedure2.ProcedureExecutor: Stopping the procedure executor
2016-05-13 03:56:53,557 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] wal.WALProcedureStore: Stopping the WAL Procedure Store
2016-05-13 03:56:53,656 INFO  [ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935_splitLogManager__ChoreService_1] master.SplitLogManager$TimeoutMonitor: Chore: SplitLogManager Timeout Monitor was stopped
2016-05-13 03:56:53,691 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] regionserver.HRegionServer: stopping server ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935
2016-05-13 03:56:53,691 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x154a916c5040001
2016-05-13 03:56:53,692 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] zookeeper.ZooKeeper: Session: 0x154a916c5040001 closed
2016-05-13 03:56:53,693 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000-EventThread] zookeeper.ClientCnxn: EventThread shut down
2016-05-13 03:56:53,703 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] regionserver.HRegionServer: stopping server ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935; all regions closed.
2016-05-13 03:56:53,703 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] hbase.ChoreService: Chore service for: ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935 had [[ScheduledChore: Name: ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935-BalancerChore Period: 300000 Unit: MILLISECONDS], [ScheduledChore: Name: ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935-ClusterStatusChore Period: 60000 Unit: MILLISECONDS], [ScheduledChore: Name: ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935-RegionNormalizerChore Period: 1800000 Unit: MILLISECONDS], [ScheduledChore: Name: HFileCleaner Period: 60000 Unit: MILLISECONDS], [ScheduledChore: Name: LogsCleaner Period: 60000 Unit: MILLISECONDS], [ScheduledChore: Name: CatalogJanitor-ip-172-31-21-148:16000 Period: 300000 Unit: MILLISECONDS]] on shutdown
2016-05-13 03:56:53,706 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x154a916c5040002
2016-05-13 03:56:53,707 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] zookeeper.ZooKeeper: Session: 0x154a916c5040002 closed
2016-05-13 03:56:53,707 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] hbase.ChoreService: Chore service for: ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935_splitLogManager_ had [] on shutdown
2016-05-13 03:56:53,707 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] flush.MasterFlushTableProcedureManager: stop: server shutting down.
2016-05-13 03:56:53,707 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] ipc.RpcServer: Stopping server on 16000
2016-05-13 03:56:53,707 INFO  [ip-172-31-21-148:16000.activeMasterManager-EventThread] zookeeper.ClientCnxn: EventThread shut down
2016-05-13 03:56:53,716 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] zookeeper.RecoverableZooKeeper: Node /hbase-unsecure/rs/ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935 already deleted, retry=false
2016-05-13 03:56:53,718 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] zookeeper.ZooKeeper: Session: 0x154a916c5040000 closed
2016-05-13 03:56:53,718 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] regionserver.HRegionServer: stopping server ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935; zookeeper connection closed.
2016-05-13 03:56:53,718 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] regionserver.HRegionServer: master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000 exiting
2016-05-13 03:56:53,718 INFO  [main-EventThread] zookeeper.ClientCnxn: EventThread shut down
1 ACCEPTED SOLUTION

avatar
Super Guru

Seems to be an issue with stale WAL splits, try removing WAL's from below hdfs location and restart the Hbase master and region services.

/apps/hbase/data/WALs/

View solution in original post

12 REPLIES 12

avatar
Explorer

In our experience, this seems to be related to https://issues.apache.org/jira/browse/HBASE-14729 which is solved in HBase 2.0.0 (we are running 1.1.2 with hdp 2.5)

avatar
Super Guru

Please re-read the description on HBASE-14729. There no code-changes made by that JIRA issue -- it was closed as a duplicate of https://issues.apache.org/jira/browse/HBASE-14223 which is still outstanding.

avatar
New Contributor

so what's the best to resolve this issue for production