Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

RegionServer Crashes while running MR jobs on Yarn, logs indicate Zookeeper related issue

avatar
Explorer

CDH 5.3.3

Hbase Version : 0.98.6-cdh5.3.3

 

One-by-one all RegionServer Crashes while running MR jobs on Yarn, logs indicate Zookeeper related issue. Batch jobs runs for 30-45 mins before regionservers crashes. Eventually crashing all the regionservers. HbaseMaster (Active) & HbaseMaster(Standby) are unaffected.

 

 

------------Region Server Log Before crashing--------------

2015-06-15 20:21:22,498 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 1691 ms, current pipeline: [10.168.32.216:50010, 10.168.32.217:50010, 10.168.32.214:50010]
2015-06-15 20:21:22,498 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 1691 ms, current pipeline: [10.168.32.216:50010, 10.168.32.217:50010, 10.168.32.214:50010]
2015-06-15 20:21:22,499 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 1688 ms, current pipeline: [10.168.32.216:50010, 10.168.32.217:50010, 10.168.32.214:50010]
2015-06-15 20:21:22,499 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 1690 ms, current pipeline: [10.168.32.216:50010, 10.168.32.217:50010, 10.168.32.214:50010]
2015-06-15 20:21:27,557 INFO org.apache.hadoop.hbase.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1614ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=1655ms
2015-06-15 20:21:27,573 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 1655 ms, current pipeline: [10.168.32.216:50010, 10.168.32.217:50010, 10.168.32.214:50010]
2015-06-15 20:21:29,103 INFO org.apache.hadoop.hbase.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1045ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=1276ms
2015-06-15 20:21:33,820 INFO org.apache.zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.5-cdh5.3.3--1, built on 04/08/2015 21:51 GMT
2015-06-15 20:21:33,839 INFO org.apache.zookeeper.ZooKeeper: Client environment:host.name=i-9f2ad949.prod-dis11.aws1
2015-06-15 20:21:33,839 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.version=1.7.0_67
2015-06-15 20:21:33,839 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation
2015-06-15 20:21:33,839 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.home=/usr/java/jdk1.7.0_67-cloudera/jre
2015-06-15 20:21:33,840 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.class.path=/var/run/cloudera-scm-agent/process/2052-hbase-REGIONSERVER:/usr/java/jdk1.7.0_67-cloudera/lib/tools.jar:/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/....... (big CLASSPATH omitted for brevity)
2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.library.path=/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native
2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp
2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.compiler=<NA>
2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.name=Linux
2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.arch=amd64
2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.version=2.6.32-358.14.1.el6.x86_64
2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.name=hbase
2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.home=/var/lib/hbase
2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.dir=/var/run/cloudera-scm-agent/process/2052-hbase-REGIONSERVER
2015-06-15 20:21:33,843 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=i-802bd856.prod-dis11.aws1:2181,i-0d2bd8db.prod-dis11.aws1:2181,i-942ad942.prod-dis11.aws1:2181 sessionTimeout=30000 watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@24ac7456
2015-06-15 20:21:33,864 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server i-0d2bd8db.prod-dis11.aws1/10.168.32.11:2181. Will not attempt to authenticate using SASL (unknown error)
2015-06-15 20:21:33,870 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to i-0d2bd8db.prod-dis11.aws1/10.168.32.11:2181, initiating session
2015-06-15 20:21:33,877 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server i-0d2bd8db.prod-dis11.aws1/10.168.32.11:2181, sessionid = 0x14dfa17990303cd, negotiated timeout = 30000

------------Corrosponding Zookeeper Log--------------

.....
2015-06-15 20:21:17,707 INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /10.168.32.218:53815
2015-06-15 20:21:17,708 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x14dfa17990303cb with negotiated timeout 180000 for client /10.168.32.218:53815
2015-06-15 20:21:17,963 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /10.168.32.218:53825
2015-06-15 20:21:17,967 INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /10.168.32.218:53825
2015-06-15 20:21:17,968 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x14dfa17990303cc with negotiated timeout 180000 for client /10.168.32.218:53825
2015-06-15 20:21:32,068 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x14dfa1799030007, likely client has closed socket
        at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
        at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
        at java.lang.Thread.run(Thread.java:745)
2015-06-15 20:21:32,068 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.168.32.216:40742 which had sessionid 0x14dfa1799030007
2015-06-15 20:21:33,870 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /10.168.32.216:59976
2015-06-15 20:21:33,874 INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /10.168.32.216:59976
2015-06-15 20:21:33,876 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x14dfa17990303cd with negotiated timeout 30000 for client /10.168.32.216:59976
2015-06-15 20:21:34,208 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x14dfa17990303cd, likely client has closed socket
        at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
        at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
        at java.lang.Thread.run(Thread.java:745)
2015-06-15 20:21:34,209 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.168.32.216:59976 which had sessionid 0x14dfa17990303cd
2015-06-15 20:21:36,321 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.168.32.221:33684 which had sessionid 0x14dfa17990303c3
2015-06-15 20:21:38,409 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x14dfa179903000b, likely client has closed socket
        at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
        at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
        at java.lang.Thread.run(Thread.java:745)
.....
.....

 

1 ACCEPTED SOLUTION

avatar
Master Collaborator

usually this is due to the map/reduce job running the regionserver OOM from a massive amount of requests or many very large requests. Then, the region of interest moves to the next regionserver, OOM again, etc. 

 

Generally to fix this you either have to give the map/reduce job less resources or you have to increase regionserver memory, sometimes you may be able to just tune down your hbase scan cache setting in your MR job. 

 

you can confirm my suspicions by checking your garbage collection log for Out of Memory errors. 

View solution in original post

2 REPLIES 2

avatar
Master Collaborator

usually this is due to the map/reduce job running the regionserver OOM from a massive amount of requests or many very large requests. Then, the region of interest moves to the next regionserver, OOM again, etc. 

 

Generally to fix this you either have to give the map/reduce job less resources or you have to increase regionserver memory, sometimes you may be able to just tune down your hbase scan cache setting in your MR job. 

 

you can confirm my suspicions by checking your garbage collection log for Out of Memory errors. 

avatar
Explorer

Thanks Ben.

I did not get OOM in the logs of (RegionServers), (HMaster), (Zookeeper), Yarn Container logs. So I don't think it was starving from memory.

However clouderaManager graphs for Regionserver reported frequent  Garbage Collection, mean time for GC  ~(23m(ms))

I went adhead with your recommendations on increasing regionserver memory. Since I had the liberty of a node with 122GB of memory, I bumped the (Java Heap Size of HBase RegionServer) from it default 4GB  to 16GB. After the switch I no loger have ReagionServer crashes and applications are performing great.

 

Later I checked the heap utilization of RegionServer on ClouderaManager Graphs, it uses (12-14 GB) of the acclocated heap.  Also my GC graphs show (86s(ms)) which is 100 times lower that ~(23m(ms)). I wish I can share both these graphs.