Reply
Explorer
Posts: 6
Registered: ‎08-15-2016

hbase crashed with DataStreamer Exception

[ Edited ]

Hi,

I posted this topic in wrong group under CM, hence i am posting duplicate here in hbase so i can expect replies from hbase users

 

I am using hbase cluster with CM 5.7. got four region servers, hbase region server crashing frequently with absolutely no load. up on looking at zookeeper logs, all zookeepera reporting continuously WARN "end of stream exception". and hmaster reporting following warning . I have looked for this both messages, couldnt proceed further...any help is appreciated..

 

zookeeper :

Aug 15, 8:48:06.301 AM INFO org.apache.zookeeper.server.NIOServerCnxn Closed socket connection for client /10.11.181.88:38619 which had sessionid 0x25637f59f7699f7

Aug 15, 8:48:56.302 AM INFO org.apache.zookeeper.server.NIOServerCnxnFactory Accepted socket connection from /10.11.181.88:38650

Aug 15, 8:48:56.303 AM INFO org.apache.zookeeper.server.ZooKeeperServer Client attempting to establish new session at /10.11.181.88:38650

Aug 15, 8:48:56.305 AM INFO org.apache.zookeeper.server.ZooKeeperServer Established session 0x25637f59f7699f9 with negotiated timeout 30000 for client /10.11.181.88:38650

Aug 15, 8:49:09.484 AM INFO org.apache.zookeeper.server.NIOServerCnxnFactory Accepted socket connection from /10.11.181.85:34388

Aug 15, 8:49:09.484 AM WARN org.apache.zookeeper.server.NIOServerCnxn caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x0, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:745)

Aug 15, 8:49:09.485 AM INFO org.apache.zookeeper.server.NIOServerCnxn Closed socket connection for client /10.11.181.85:34388 (no session established for client)

Aug 15, 8:49:11.259 AM INFO org.apache.zookeeper.server.NIOServerCnxnFactory Accepted socket connection from /10.11.181.88:38659

Aug 15, 8:49:11.259 AM INFO org.apache.zookeeper.server.ZooKeeperServer Client attempting to establish new session at /10.11.181.88:38659

Aug 15, 8:49:11.260 AM INFO org.apache.zookeeper.server.ZooKeeperServer Established session 0x25637f59f7699fa with negotiated timeout 30000 for client /10.11.181.88:38659

Aug 15, 8:49:11.283 AM INFO org.apache.zookeeper.server.NIOServerCnxn Closed socket connection for client /10.11.181.88:38659 which had sessionid 0x25637f59f7699fa

Aug 15, 8:49:26.337 AM INFO org.apache.zookeeper.server.NIOServerCnxn Closed socket connection for client /10.11.181.88:38650 which had sessionid 0x25637f59f7699f9

 

Region server log:

Aug 6, 3:33:45.187 AM FATAL org.apache.hadoop.hbase.regionserver.HRegionServer ABORTING region server hadoop-d02.s.abccompany.net,60020,1469814949631: regionserver:60020-0x35637cb94ab0000, quorum=zookeeper-d03.s.abccompany.net:2181,zookeeper-d01.s.abccompany.net:2181,zookeeper-d02.s.abccompany.net:2181, baseZNode=/hbase regionserver:60020-0x35637cb94ab0000 received expired from ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:700)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:611)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)

http://pastebin.com/raw/h20BTpPf

 

Hmaster:

2016-08-15 08:40:52,101 WARN org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Failed serverName=hadoop-d02.s.abccompany.net,60020,1469814949631, state=SERVER_CRASH_GET_REGIONS; retry Waiting on hbase:meta assignment; cycle=7861852, running for 221hrs, 8mins, 46sec
2016-08-15 08:40:52,101 WARN org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Failed serverName=hadoop-d03.s.abccompany.net,60020,1469814949885, state=SERVER_CRASH_GET_REGIONS; retry Waiting on hbase:meta assignment; cycle=10383232, running for 292hrs, 2mins, 41sec
2016-08-15 08:40:52,102 INFO org.apache.hadoop.hbase.zookeeper.MetaTableLocator: Failed verification of hbase:meta,,1 at address=hadoop-d01.s.abccompany.net,60020,1469814949889, exception=This server is in the failed servers list: hadoop-d01.s.abccompany.net/10.11.181.89:60020
2016-08-15 08:40:52,102 INFO org.apache.hadoop.hbase.zookeeper.MetaTableLocator: Failed verification of hbase:meta,,1 at address=hadoop-d01.s.abccompany.net,60020,1469814949889, exception=This connection is closing
2016-08-15 08:40:52,202 WARN org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Failed serverName=hadoop-d03.s.abccompany.net,60020,1469814949885, state=SERVER_CRASH_GET_REGIONS; retry Waiting on hbase:meta assignment; cycle=10383233, running for 292hrs, 2mins, 41sec
2016-08-15 08:40:52,202 WARN org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Failed serverName=hadoop-d02.s.abccompany.net,60020,1469814949631, state=SERVER_CRASH_GET_REGIONS; retry Waiting on hbase:meta assignment; cycle=7861853, running for 221hrs, 8mins, 46sec

 

Thanks in advance

Explorer
Posts: 6
Registered: ‎08-15-2016

Re: hbase crashed with DataStreamer Exception

Any direction is appreciated!!

Expert Contributor
Posts: 101
Registered: ‎01-24-2014

Re: hbase crashed with DataStreamer Exception

[ Edited ]

Aug 6, 3:33:00.749 AM WARN org.apache.hadoop.hbase.util.Sleeper We slept 110921ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired

 

This message in your pastebin is the problem. if you sleep longer than 30s, the znode in zookeeper will expire and the master will re-assign the regions. (in this case you gc'd for 110 seconds). when the regionserver comes back from GC it will realize it has been gone too long and kill itself.

 

Since you say you had no load on hbase itself i'd suspect the jvm was starved of resources to use for it's garbage collection. Typically that would point to map/reduce consuming too much CPU, which in turn starves the GC for hbase, which then knocks out the regionserver.  The reason this wouldn't be knocking out datanodes and nodemanagers is that they both have a default expiration time of 11 minutes. This is fine in a batch processing system, designed to be run on commodity hardware.

If this is not the problem, then I would ask a follow up question of what JVM options are you running?

Explorer
Posts: 6
Registered: ‎08-15-2016

Re: hbase crashed with DataStreamer Exception

the following parameters java i set

 

 

Client Java Heap Size in Bytes : 256 Mb

  

Hmaster:

-Xms4294967296 -Xmx4294967296

 

regionserver:

-Xms5915017216 -Xmx5915017216

Posts: 1,754
Kudos: 371
Solutions: 279
Registered: ‎07-31-2013

Re: hbase crashed with DataStreamer Exception

Note that in this case the JVM did pause but it was not due to GC, like the JvmPauseMonitor's second line notes in the pastebin:

Aug 6, 3:33:07.979 AM WARN org.apache.hadoop.hbase.util.JvmPauseMonitor Detected pause in JVM or host machine (eg GC): pause of approximately 113786ms
No GCs detected

Such a pause (non-GC pause) would be attributed to the process being blocked at a lower level, such as during wait of an important resource not being made available to it by the scheduler (or blocked away for other reasons). You'll need to look lower than the JVM to debug this, starting at your dmesg.

There are multiple sources to such a problem, and usually the impact is wider too (other roles would similarly face arbitrary pauses if its an ongoing trouble).
New Contributor
Posts: 1
Registered: ‎01-22-2018

Re: hbase crashed with DataStreamer Exception

Hi I have the same problem , with random hbase region servers crashing one at a time. Can you help me understand , if this is a normal process. 

 

And how to check at a lower level and confirm what exactly is happening on the node , by using dmesg.

 

Thanks,

Naren.

Announcements