Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hbase master and region servers crashing randomly

Hbase master and region servers crashing randomly

Explorer

I am seeing our hbase master and region servers crashing seemingly randomly. This is happening across different clusters. Here are the logs from one of the region servers and the hbase master. Not sure if this is a zookeeper issue, a host issue, or a possible network issue.

Region Servers

2017-01-15 11:32:57,186 ERROR [regionserver60020] zookeeper.RecoverableZooKeeper: ZooKeeper getChildren failed after 4 attempts 2017-01-15 11:33:12,190 ERROR [regionserver60020] zookeeper.RecoverableZooKeeper: ZooKeeper delete failed after 4 attempts 2017-01-15 11:33:12,217 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting

Hbase Master

2017-01-15 11:32:39,672 ERROR [FifoRpcScheduler.handler1-thread-2] master.HMaster: Region server node045.phibred.com,60020,1483668825525 reported a fatal error: 2017-01-15 11:32:39,715 ERROR [FifoRpcScheduler.handler1-thread-3] master.HMaster: Region server node043.phibred.com,60020,1483256057979 reported a fatal error: 2017-01-15 11:32:39,725 ERROR [FifoRpcScheduler.handler1-thread-6] master.HMaster: Region server node039.phibred.com,60020,1483256059362 reported a fatal error: 2017-01-15 11:32:39,726 ERROR [FifoRpcScheduler.handler1-thread-8] master.HMaster: Region server node038.phibred.com,60020,1483256057918 reported a fatal error: 2017-01-15 11:32:39,728 ERROR [FifoRpcScheduler.handler1-thread-9] master.HMaster: Region server node042.phibred.com,60020,1483668821289 reported a fatal error: 2017-01-15 11:32:39,729 ERROR [FifoRpcScheduler.handler1-thread-7] master.HMaster: Region server node041.phibred.com,60020,1483256065224 reported a fatal error: 2017-01-15 11:32:39,776 ERROR [FifoRpcScheduler.handler1-thread-4] master.HMaster: Region server node044.phibred.com,60020,1483668827139 reported a fatal error: 2017-01-15 11:32:39,782 ERROR [FifoRpcScheduler.handler1-thread-1] master.HMaster: Region server node040.phibred.com,60020,1483405735247 reported a fatal error: 2017-01-15 11:37:40,272 ERROR [main] master.HMasterCommandLine: Master exiting

3 REPLIES 3
Highlighted

Re: Hbase master and region servers crashing randomly

New Contributor

@Alex Eifler, check in your Zookeeper logs around the 2017-01-15 11:33:12 timeframe.

There may be information there as to why the RegionServer could not complete it's call to zk.

Highlighted

Re: Hbase master and region servers crashing randomly

Explorer

There are tons of timeout errors at the same time in the zookeeper logs.

zookeeper

2017-01-15 11:32:18,000 - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x3595d209c6f0086, timeout of 30000ms exceeded 2017-01-15 11:32:39,652 - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x359711fd88d005f, timeout of 30000ms exceeded 2017-01-15 11:32:39,652 - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x15970e9e7b10001, timeout of 30000ms exceeded 2017-01-15 11:32:39,652 - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x359711fd88d0060, timeout of 30000ms exceeded 2017-01-15 11:32:39,652 - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x1595564dd770057, timeout of 30000ms exceeded 2017-01-15 11:32:39,652 - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x25970e9e7b10001, timeout of 30000ms exceeded 2017-01-15 11:32:39,653 - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x15970e9e7b10002, timeout of 30000ms exceeded 2017-01-15 11:32:39,653 - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x359711fd88d0061, timeout of 30000ms exceeded 2017-01-15 11:32:39,653 - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x259711fd8830097, timeout of 30000ms exceeded 2017-01-15 11:32:39,653 - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x35970e9e7b80000, timeout of 30000ms exceeded 2017-01-15 11:32:39,653 - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x3595d209c6f0087, timeout of 30000ms exceeded 2017-01-15 11:32:39,653 - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x259711fd8830099, timeout of 30000ms exceeded 2017-01-15 11:32:39,653 - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x35956e33ef80050, timeout of 30000ms exceeded 2017-01-15 11:32:39,653 - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x35970e9e7b8000c, timeout of 30000ms exceeded 2017-01-15 11:32:39,653 - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x1595564dd770051, timeout of 30000ms exceeded 2017-01-15 11:32:39,653 - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x259711fd8830098, timeout of 30000ms exceeded 2017-01-15 11:32:39,653 - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x259711fd883009a, timeout of 30000ms exceeded 2017-01-15 11:32:39,653 - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x2595d209c690091, timeout of 30000ms exceeded 2017-01-15 11:32:39,653 - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x359711fd88d2062, timeout of 30000ms exceeded 2017-01-15 11:32:39,653 - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x159711fd8830070, timeout of 30000ms exceeded 2017-01-15 11:32:39,653 - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x159711fd883006f, timeout of 30000ms exceeded 2017-01-15 11:32:39,653 - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x35970e9e7b80001, timeout of 30000ms exceeded 2017-01-15 11:32:39,653 - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x1595564dd770054, timeout of 30000ms exceeded 2017-01-15 11:32:39,654 - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x2595564dd7a0055, timeout of 30000ms exceeded

Highlighted

Re: Hbase master and region servers crashing randomly

Look at the JVM GC logs for your HBase services. If you are experiencing stop-the-world pauses, it could cause ZK session expiration.

It could also be ZooKeeper's built in rate-limiting: https://community.hortonworks.com/articles/51191/understanding-apache-zookeeper-connection-rate-lim....

Don't have an account?
Coming from Hortonworks? Activate your account here