Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Why region server suddenly shutdown?

Why region server suddenly shutdown?

New Contributor

Hi Everyone,

I am new research about HDP, using HDFS and Hbase.

Today, Region server suddenly shutdown. Please, you help me.! I have log below:

2016-10-01 19:14:38,975 INFO  [regionserver60020] regionserver.HRegionServer: stopping server rs14.srv.com,60020,1469499256438; all regions closed.
2016-10-01 19:14:38,976 DEBUG [regionserver60020-WAL.AsyncNotifier] wal.FSHLog: regionserver60020-WAL.AsyncNotifier interrupted while waiting for  notification from AsyncSyncer thread
2016-10-01 19:14:38,976 INFO  [regionserver60020-WAL.AsyncNotifier] wal.FSHLog: regionserver60020-WAL.AsyncNotifier exiting
2016-10-01 19:14:38,976 DEBUG [regionserver60020-WAL.AsyncSyncer0] wal.FSHLog: regionserver60020-WAL.AsyncSyncer0 interrupted while waiting for notification from AsyncWriter thread
2016-10-01 19:14:38,976 INFO  [regionserver60020-WAL.AsyncSyncer0] wal.FSHLog: regionserver60020-WAL.AsyncSyncer0 exiting
2016-10-01 19:14:38,976 DEBUG [regionserver60020-WAL.AsyncSyncer1] wal.FSHLog: regionserver60020-WAL.AsyncSyncer1 interrupted while waiting for notification from AsyncWriter thread
2016-10-01 19:14:38,976 INFO  [regionserver60020-WAL.AsyncSyncer1] wal.FSHLog: regionserver60020-WAL.AsyncSyncer1 exiting
2016-10-01 19:14:38,977 DEBUG [regionserver60020-WAL.AsyncSyncer2] wal.FSHLog: regionserver60020-WAL.AsyncSyncer2 interrupted while waiting for notification from AsyncWriter thread
2016-10-01 19:14:38,977 INFO  [regionserver60020-WAL.AsyncSyncer2] wal.FSHLog: regionserver60020-WAL.AsyncSyncer2 exiting
2016-10-01 19:14:38,977 DEBUG [regionserver60020-WAL.AsyncSyncer3] wal.FSHLog: regionserver60020-WAL.AsyncSyncer3 interrupted while waiting for notification from AsyncWriter thread
2016-10-01 19:14:38,977 INFO  [regionserver60020-WAL.AsyncSyncer3] wal.FSHLog: regionserver60020-WAL.AsyncSyncer3 exiting
2016-10-01 19:14:38,977 DEBUG [regionserver60020-WAL.AsyncSyncer4] wal.FSHLog: regionserver60020-WAL.AsyncSyncer4 interrupted while waiting for notification from AsyncWriter thread
2016-10-01 19:14:38,977 INFO  [regionserver60020-WAL.AsyncSyncer4] wal.FSHLog: regionserver60020-WAL.AsyncSyncer4 exiting
2016-10-01 19:14:38,977 DEBUG [regionserver60020-WAL.AsyncWriter] wal.FSHLog: regionserver60020-WAL.AsyncWriter interrupted while waiting for newer writes added to local buffer
2016-10-01 19:14:38,977 INFO  [regionserver60020-WAL.AsyncWriter] wal.FSHLog: regionserver60020-WAL.AsyncWriter exiting
2016-10-01 19:14:38,977 DEBUG [regionserver60020] wal.FSHLog: Closing WAL writer in hdfs://srvcluster/apps/hbase/data/WALs/rs14.srv.com,60020,1469499256438
2016-10-01 19:14:38,977 ERROR [regionserver60020] wal.ProtobufLogWriter: Got IOException while writing trailer
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /apps/hbase/data/oldWALs/rs14.srv.com%2C60020%2C1469499256438.1475321424271 (inode 386054527): File is not open for writing. Holder DFSClient_hb_rs_d
n14.srv.com,60020,1469499256438_550076685_33 does not have any open files.
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3778)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3678)
        ............
2016-10-01 19:14:39,008 ERROR [regionserver60020] regionserver.HRegionServer: Close and delete failed
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /apps/hbase/data/oldWALs/rs14.srv.com%2C60020%2C1469499256438.1475321424271 (inode 386054527): File is not open for writing. Holder DFSClient_hb_rs_rs14.srv.com,60020,1469499256438_55007668
5_33 does not have any open files.
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3778)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3678)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:675)
        .....
2016-10-01 19:14:39,009 DEBUG [regionserver60020] ipc.RpcClient: Stopping rpc client
2016-10-01 19:14:39,109 INFO  [regionserver60020] regionserver.Leases: regionserver60020 closing leases
2016-10-01 19:14:39,114 INFO  [regionserver60020] regionserver.Leases: regionserver60020 closed leases
2016-10-01 19:14:39,114 INFO  [regionserver60020] regionserver.CompactSplitThread: Waiting for Split Thread to finish...
2016-10-01 19:14:39,114 INFO  [regionserver60020] regionserver.CompactSplitThread: Waiting for Merge Thread to finish...
2016-10-01 19:14:39,114 INFO  [regionserver60020] regionserver.CompactSplitThread: Waiting for Large Compaction Thread to finish...
2016-10-01 19:14:39,114 INFO  [regionserver60020] regionserver.CompactSplitThread: Waiting for Small Compaction Thread to finish...
2016-10-01 19:14:39,115 WARN  [regionserver60020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=zk2.srv.com:2181,zk1.srv.com:2181,zk3.srv.com:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Ses
sion expired for /hbase-unsecure/replication/rs/rs14.srv.com,60020,1469499256438
2016-10-01 19:14:40,116 WARN  [regionserver60020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=zk2.srv.com:2181,zk1.srv.com:2181,zk3.srv.com:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Ses
sion expired for /hbase-unsecure/replication/rs/rs14.srv.com,60020,1469499256438
2016-10-01 19:14:42,116 WARN  [regionserver60020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=zk2.srv.com:2181,zk1.srv.com:2181,zk3.srv.com:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Ses
sion expired for /hbase-unsecure/replication/rs/rs14.srv.com,60020,1469499256438
2016-10-01 19:14:46,116 WARN  [regionserver60020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=zk2.srv.com:2181,zk1.srv.com:2181,zk3.srv.com:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Ses
sion expired for /hbase-unsecure/replication/rs/rs14.srv.com,60020,1469499256438
2016-10-01 19:14:54,117 WARN  [regionserver60020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=zk2.srv.com:2181,zk1.srv.com:2181,zk3.srv.com:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Ses
sion expired for /hbase-unsecure/replication/rs/rs14.srv.com,60020,1469499256438
2016-10-01 19:14:54,117 ERROR [regionserver60020] zookeeper.RecoverableZooKeeper: ZooKeeper getChildren failed after 4 attempts
2016-10-01 19:14:54,117 DEBUG [regionserver60020] ipc.RpcClient: Stopping rpc client
2016-10-01 19:14:54,117 WARN  [regionserver60020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=zk2.srv.com:2181,zk1.srv.com:2181,zk3.srv.com:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Ses
sion expired for /hbase-unsecure/rs/rs14.srv.com,60020,1469499256438
2016-10-01 19:14:55,118 WARN  [regionserver60020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=zk2.srv.com:2181,zk1.srv.com:2181,zk3.srv.com:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Ses
sion expired for /hbase-unsecure/rs/rs14.srv.com,60020,1469499256438
2016-10-01 19:14:57,118 WARN  [regionserver60020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=zk2.srv.com:2181,zk1.srv.com:2181,zk3.srv.com:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Ses
sion expired for /hbase-unsecure/rs/rs14.srv.com,60020,1469499256438
2016-10-01 19:15:01,118 WARN  [regionserver60020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=zk2.srv.com:2181,zk1.srv.com:2181,zk3.srv.com:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Ses
sion expired for /hbase-unsecure/rs/rs14.srv.com,60020,1469499256438
2016-10-01 19:15:09,119 WARN  [regionserver60020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=zk2.srv.com:2181,zk1.srv.com:2181,zk3.srv.com:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Ses
sion expired for /hbase-unsecure/rs/rs14.srv.com,60020,1469499256438
2016-10-01 19:15:09,119 ERROR [regionserver60020] zookeeper.RecoverableZooKeeper: ZooKeeper delete failed after 4 attempts
2016-10-01 19:15:09,119 WARN  [regionserver60020] regionserver.HRegionServer: Failed deleting my ephemeral node
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/rs14.srv.com,60020,1469499256438
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        .....
2016-10-01 19:15:09,138 INFO  [regionserver60020] regionserver.HRegionServer: stopping server rs14.srv.com,60020,1469499256438; zookeeper connection closed.
2016-10-01 19:15:09,138 INFO  [regionserver60020] regionserver.HRegionServer: regionserver60020 exiting
2016-10-01 19:15:09,138 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting
java.lang.RuntimeException: HRegionServer Aborted

        at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:66)
        at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:85)
        .......
2016-10-01 19:15:09,168 INFO  [Thread-11] regionserver.ShutdownHook: Shutdown hook starting; hbase.shutdown.hook=true; fsShutdownHook=org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer@750f19ee
2016-10-01 19:15:09,168 INFO  [Thread-11] regionserver.ShutdownHook: Starting fs shutdown hook thread.
2016-10-01 19:15:09,169 INFO  [Thread-11] regionserver.ShutdownHook: Shutdown hook finished.

Thanks.

6 REPLIES 6

Re: Why region server suddenly shutdown?

@Mr GL

Can you attach full logs of region server to find the exact reason for failure or shutdown?

Re: Why region server suddenly shutdown?

Hello MrGL

What was your pattern when the regionserver went down? loading data, like a heavy write? heavy reads? Is it only one region server or all of them?

this message "KeeperException$SessionExpiredException" points to a zookeeper sessions time out so you region has probably not heartbeated back in a timely fashion, you can start by increasing the session timeout config and then looking at the pattern of issue. My guess would be heavy write load on a regionserver with skew and a compaction that takes too long.

Re: Why region server suddenly shutdown?

Could have also experienced a JVM garbage collection pause which would have caused this.

Highlighted

Re: Why region server suddenly shutdown?

Super Collaborator

The following was related to hdfs.

Please check hdfs healthiness as well.

No lease on /apps/hbase/data/oldWALs/rs14.srv.com%2C60020%2C1469499256438.1475321424271(inode 386054527):Fileisnot open for writing.

Re: Why region server suddenly shutdown?

Expert Contributor

Need to add property "dfs.datanode.max.transfer.threads" in hdfs-site.xml, the default value is 4096(In most time, this value is enough for hbase jobs, refer https://issues.apache.org/jira/browse/HDFS-1861), now set to 8192 or higher. <property>

<name>dfs.datanode.max.transfer.threads</name> <value>8192</value>

</property>

Re: Why region server suddenly shutdown?

yes, full logs are required to look into this. Logs before "stopping server" log line will give details what has actually happened?

Don't have an account?
Coming from Hortonworks? Activate your account here