Support Questions

Find answers, ask questions, and share your expertise

HBase master fails to start

avatar
Contributor

Hello,

HBase master wouldn't start spitting the below error.

2016-05-13 03:56:53,431 FATAL [ip-172-31-21-148:16000.activeMasterManager] master.HMaster: Failed to become active master java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned

It appears to be started okay in Ambari view, but the it shows stopped in about 3 minutes.

FYI. it's a Vanilla HDP-2.4.0.0-169 install plus some configuration changes on disks, HDFS, but I haven't created anything on Hbase.

The attached is the screenshot of the HBase view of Ambari.

4224-screen-shot-2016-05-13-at-61503-pm.png

Below are the log at the time of failure:

/var/log/hbase/hbase-hbase-master-serverhost.log

Could you please advise on where to look at to make it started?

Thanks,

2016-05-13 03:56:41,651 INFO  [ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935_splitLogManager__ChoreService_1] master.SplitLogManager: total tasks = 1 unassigned = 0 tasks={/hbase-unsecure/splitWAL/WALs%2Fip-172-31-25-148.ap-southeast-2.compute.internal%2C16020%2C1463025987708-splitting%2Fip-172-31-25-148.ap-southeast-2.compute.internal%252C16020%252C1463025987708.default.1463030679002=last_update = 1463126191938 last_version = 399 cur_worker_name = ip-172-31-28-128.ap-southeast-2.compute.internal,16020,1463125906252 status = in_progress incarnation = 1 resubmits = 1 batch = installed = 1 done = 0 error = 0}
2016-05-13 03:56:47,651 INFO  [ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935_splitLogManager__ChoreService_1] master.SplitLogManager: total tasks = 1 unassigned = 0 tasks={/hbase-unsecure/splitWAL/WALs%2Fip-172-31-25-148.ap-southeast-2.compute.internal%2C16020%2C1463025987708-splitting%2Fip-172-31-25-148.ap-southeast-2.compute.internal%252C16020%252C1463025987708.default.1463030679002=last_update = 1463126191938 last_version = 399 cur_worker_name = ip-172-31-28-128.ap-southeast-2.compute.internal,16020,1463125906252 status = in_progress incarnation = 1 resubmits = 1 batch = installed = 1 done = 0 error = 0}
2016-05-13 03:56:53,431 FATAL [ip-172-31-21-148:16000.activeMasterManager] master.HMaster: Failed to become active master
java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned
    at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:104)
    at org.apache.hadoop.hbase.master.HMaster.initNamespace(HMaster.java:1005)
    at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:799)
    at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:191)
    at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1783)
    at java.lang.Thread.run(Thread.java:745)
2016-05-13 03:56:53,433 FATAL [ip-172-31-21-148:16000.activeMasterManager] master.HMaster: Master server abort: loaded coprocessors are: []
2016-05-13 03:56:53,433 FATAL [ip-172-31-21-148:16000.activeMasterManager] master.HMaster: Unhandled exception. Starting shutdown.
java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned
    at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:104)
    at org.apache.hadoop.hbase.master.HMaster.initNamespace(HMaster.java:1005)
    at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:799)
    at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:191)
    at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1783)
    at java.lang.Thread.run(Thread.java:745)
2016-05-13 03:56:53,433 INFO  [ip-172-31-21-148:16000.activeMasterManager] regionserver.HRegionServer: STOPPED: Unhandled exception. Starting shutdown.
2016-05-13 03:56:53,433 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] regionserver.HRegionServer: Stopping infoServer
2016-05-13 03:56:53,456 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] mortbay.log: Stopped SelectChannelConnector@0.0.0.0:16010
2016-05-13 03:56:53,500 WARN  [MASTER_SERVER_OPERATIONS-ip-172-31-21-148:16000-1] master.SplitLogManager: Stopped while waiting for log splits to be completed
2016-05-13 03:56:53,500 WARN  [MASTER_SERVER_OPERATIONS-ip-172-31-21-148:16000-1] master.SplitLogManager: error while splitting logs in [hdfs://ip-172-31-23-165.ap-southeast-2.compute.internal:8020/apps/hbase/data/WALs/ip-172-31-25-148.ap-southeast-2.compute.internal,16020,1463025987708-splitting] installed = 1 but only 0 done
2016-05-13 03:56:53,501 ERROR [MASTER_SERVER_OPERATIONS-ip-172-31-21-148:16000-1] executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN
java.io.IOException: failed log splitting for ip-172-31-25-148.ap-southeast-2.compute.internal,16020,1463025987708, will retry
    at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.resubmit(ServerShutdownHandler.java:357)
    at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:220)
    at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error or interrupted while splitting logs in [hdfs://ip-172-31-23-165.ap-southeast-2.compute.internal:8020/apps/hbase/data/WALs/ip-172-31-25-148.ap-southeast-2.compute.internal,16020,1463025987708-splitting] Task = installed = 1 done = 0 error = 0
    at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:290)
    at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:391)
    at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:364)
    at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:286)
    at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:213)
    ... 4 more
2016-05-13 03:56:53,502 ERROR [MASTER_SERVER_OPERATIONS-ip-172-31-21-148:16000-3] executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN
java.io.IOException: Server is stopped
    at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:194)
    at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
2016-05-13 03:56:53,556 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] procedure2.ProcedureExecutor: Stopping the procedure executor
2016-05-13 03:56:53,557 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] wal.WALProcedureStore: Stopping the WAL Procedure Store
2016-05-13 03:56:53,656 INFO  [ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935_splitLogManager__ChoreService_1] master.SplitLogManager$TimeoutMonitor: Chore: SplitLogManager Timeout Monitor was stopped
2016-05-13 03:56:53,691 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] regionserver.HRegionServer: stopping server ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935
2016-05-13 03:56:53,691 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x154a916c5040001
2016-05-13 03:56:53,692 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] zookeeper.ZooKeeper: Session: 0x154a916c5040001 closed
2016-05-13 03:56:53,693 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000-EventThread] zookeeper.ClientCnxn: EventThread shut down
2016-05-13 03:56:53,703 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] regionserver.HRegionServer: stopping server ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935; all regions closed.
2016-05-13 03:56:53,703 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] hbase.ChoreService: Chore service for: ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935 had [[ScheduledChore: Name: ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935-BalancerChore Period: 300000 Unit: MILLISECONDS], [ScheduledChore: Name: ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935-ClusterStatusChore Period: 60000 Unit: MILLISECONDS], [ScheduledChore: Name: ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935-RegionNormalizerChore Period: 1800000 Unit: MILLISECONDS], [ScheduledChore: Name: HFileCleaner Period: 60000 Unit: MILLISECONDS], [ScheduledChore: Name: LogsCleaner Period: 60000 Unit: MILLISECONDS], [ScheduledChore: Name: CatalogJanitor-ip-172-31-21-148:16000 Period: 300000 Unit: MILLISECONDS]] on shutdown
2016-05-13 03:56:53,706 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x154a916c5040002
2016-05-13 03:56:53,707 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] zookeeper.ZooKeeper: Session: 0x154a916c5040002 closed
2016-05-13 03:56:53,707 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] hbase.ChoreService: Chore service for: ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935_splitLogManager_ had [] on shutdown
2016-05-13 03:56:53,707 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] flush.MasterFlushTableProcedureManager: stop: server shutting down.
2016-05-13 03:56:53,707 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] ipc.RpcServer: Stopping server on 16000
2016-05-13 03:56:53,707 INFO  [ip-172-31-21-148:16000.activeMasterManager-EventThread] zookeeper.ClientCnxn: EventThread shut down
2016-05-13 03:56:53,716 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] zookeeper.RecoverableZooKeeper: Node /hbase-unsecure/rs/ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935 already deleted, retry=false
2016-05-13 03:56:53,718 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] zookeeper.ZooKeeper: Session: 0x154a916c5040000 closed
2016-05-13 03:56:53,718 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] regionserver.HRegionServer: stopping server ip-172-31-21-148.ap-southeast-2.compute.internal,16000,1463125847935; zookeeper connection closed.
2016-05-13 03:56:53,718 INFO  [master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000] regionserver.HRegionServer: master/ip-172-31-21-148.ap-southeast-2.compute.internal/172.31.21.148:16000 exiting
2016-05-13 03:56:53,718 INFO  [main-EventThread] zookeeper.ClientCnxn: EventThread shut down
1 ACCEPTED SOLUTION

avatar
Super Guru

Seems to be an issue with stale WAL splits, try removing WAL's from below hdfs location and restart the Hbase master and region services.

/apps/hbase/data/WALs/

View solution in original post

12 REPLIES 12

avatar
Super Guru

Seems to be an issue with stale WAL splits, try removing WAL's from below hdfs location and restart the Hbase master and region services.

/apps/hbase/data/WALs/

avatar
Super Guru

@WONHEE CHOI

Hi, please accept this as an answer if that resolved the issue.

avatar
Contributor

This did the trick.

After removing WAL files and restarting HBase components, HBase master doesn't go down anymore.

I took some time to double check everything before accepting the answer.

Thank you so much.

avatar
Super Guru

Please be aware that removing the write-ahead logs is causing data loss. Removing them is almost never an ideal situation (unless there is no data in HBase).

avatar
Contributor

Thanks Josh,

In the above case there was no data to preserve but I will be aware of the data loss by removing WAL files.

I don't understand then why this has happened in the first place. Can umounting and mounting HDFS data directory to the new directory, or forceful shutdown of the cluster hosts cause this?

I believe that this could be investigated separately.

Much appreciated.

avatar
Rising Star

we got same issue , i removed the WALs files and restarted the cluster , still the issue is persistent . what might be the reason ??

avatar
Explorer

It solved my problem .... Thanks a lot Jitendra Yadav

avatar
Explorer

I am concerned that the accepted answer includes advice to delete data. In our case, the master node needed a few restarts before having settled on both which WALEdits were still in conflict and not getting TimeoutExceptions back from the data nodes. Eventually, it had all the outstanding edits resolved and went back to normal mode of operation.

avatar
Super Guru

Your concern is very valid, Steen. Often times, this root problem is latency in the cluster recovering from a partially failed state. The system can often recover this on its own as you have noticed. Deleting the HBase data should *only* ever be used when the state is in an "empty state". It is not an acceptable answer for "real" systems.