Created on 06-16-2015 12:33 PM - edited 09-16-2022 02:31 AM
CDH 5.3.3
Hbase Version : 0.98.6-cdh5.3.3
One-by-one all RegionServer Crashes while running MR jobs on Yarn, logs indicate Zookeeper related issue. Batch jobs runs for 30-45 mins before regionservers crashes. Eventually crashing all the regionservers. HbaseMaster (Active) & HbaseMaster(Standby) are unaffected.
------------Region Server Log Before crashing--------------
2015-06-15 20:21:22,498 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 1691 ms, current pipeline: [10.168.32.216:50010, 10.168.32.217:50010, 10.168.32.214:50010] 2015-06-15 20:21:22,498 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 1691 ms, current pipeline: [10.168.32.216:50010, 10.168.32.217:50010, 10.168.32.214:50010] 2015-06-15 20:21:22,499 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 1688 ms, current pipeline: [10.168.32.216:50010, 10.168.32.217:50010, 10.168.32.214:50010] 2015-06-15 20:21:22,499 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 1690 ms, current pipeline: [10.168.32.216:50010, 10.168.32.217:50010, 10.168.32.214:50010] 2015-06-15 20:21:27,557 INFO org.apache.hadoop.hbase.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1614ms GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=1655ms 2015-06-15 20:21:27,573 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 1655 ms, current pipeline: [10.168.32.216:50010, 10.168.32.217:50010, 10.168.32.214:50010] 2015-06-15 20:21:29,103 INFO org.apache.hadoop.hbase.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1045ms GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=1276ms 2015-06-15 20:21:33,820 INFO org.apache.zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.5-cdh5.3.3--1, built on 04/08/2015 21:51 GMT 2015-06-15 20:21:33,839 INFO org.apache.zookeeper.ZooKeeper: Client environment:host.name=i-9f2ad949.prod-dis11.aws1 2015-06-15 20:21:33,839 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.version=1.7.0_67 2015-06-15 20:21:33,839 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation 2015-06-15 20:21:33,839 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.home=/usr/java/jdk1.7.0_67-cloudera/jre 2015-06-15 20:21:33,840 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.class.path=/var/run/cloudera-scm-agent/process/2052-hbase-REGIONSERVER:/usr/java/jdk1.7.0_67-cloudera/lib/tools.jar:/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/....... (big CLASSPATH omitted for brevity) 2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.library.path=/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native 2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp 2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.compiler=<NA> 2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.name=Linux 2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.arch=amd64 2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.version=2.6.32-358.14.1.el6.x86_64 2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.name=hbase 2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.home=/var/lib/hbase 2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.dir=/var/run/cloudera-scm-agent/process/2052-hbase-REGIONSERVER 2015-06-15 20:21:33,843 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=i-802bd856.prod-dis11.aws1:2181,i-0d2bd8db.prod-dis11.aws1:2181,i-942ad942.prod-dis11.aws1:2181 sessionTimeout=30000 watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@24ac7456 2015-06-15 20:21:33,864 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server i-0d2bd8db.prod-dis11.aws1/10.168.32.11:2181. Will not attempt to authenticate using SASL (unknown error) 2015-06-15 20:21:33,870 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to i-0d2bd8db.prod-dis11.aws1/10.168.32.11:2181, initiating session 2015-06-15 20:21:33,877 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server i-0d2bd8db.prod-dis11.aws1/10.168.32.11:2181, sessionid = 0x14dfa17990303cd, negotiated timeout = 30000
------------Corrosponding Zookeeper Log--------------
..... 2015-06-15 20:21:17,707 INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /10.168.32.218:53815 2015-06-15 20:21:17,708 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x14dfa17990303cb with negotiated timeout 180000 for client /10.168.32.218:53815 2015-06-15 20:21:17,963 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /10.168.32.218:53825 2015-06-15 20:21:17,967 INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /10.168.32.218:53825 2015-06-15 20:21:17,968 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x14dfa17990303cc with negotiated timeout 180000 for client /10.168.32.218:53825 2015-06-15 20:21:32,068 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x14dfa1799030007, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:745) 2015-06-15 20:21:32,068 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.168.32.216:40742 which had sessionid 0x14dfa1799030007 2015-06-15 20:21:33,870 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /10.168.32.216:59976 2015-06-15 20:21:33,874 INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /10.168.32.216:59976 2015-06-15 20:21:33,876 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x14dfa17990303cd with negotiated timeout 30000 for client /10.168.32.216:59976 2015-06-15 20:21:34,208 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x14dfa17990303cd, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:745) 2015-06-15 20:21:34,209 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.168.32.216:59976 which had sessionid 0x14dfa17990303cd 2015-06-15 20:21:36,321 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.168.32.221:33684 which had sessionid 0x14dfa17990303c3 2015-06-15 20:21:38,409 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x14dfa179903000b, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:745) ..... .....
Created 06-16-2015 03:04 PM
usually this is due to the map/reduce job running the regionserver OOM from a massive amount of requests or many very large requests. Then, the region of interest moves to the next regionserver, OOM again, etc.
Generally to fix this you either have to give the map/reduce job less resources or you have to increase regionserver memory, sometimes you may be able to just tune down your hbase scan cache setting in your MR job.
you can confirm my suspicions by checking your garbage collection log for Out of Memory errors.
Created 06-16-2015 03:04 PM
usually this is due to the map/reduce job running the regionserver OOM from a massive amount of requests or many very large requests. Then, the region of interest moves to the next regionserver, OOM again, etc.
Generally to fix this you either have to give the map/reduce job less resources or you have to increase regionserver memory, sometimes you may be able to just tune down your hbase scan cache setting in your MR job.
you can confirm my suspicions by checking your garbage collection log for Out of Memory errors.
Created on 06-19-2015 10:15 AM - edited 06-19-2015 10:18 AM
Thanks Ben.
I did not get OOM in the logs of (RegionServers), (HMaster), (Zookeeper), Yarn Container logs. So I don't think it was starving from memory.
However clouderaManager graphs for Regionserver reported frequent Garbage Collection, mean time for GC ~(23m(ms))
I went adhead with your recommendations on increasing regionserver memory. Since I had the liberty of a node with 122GB of memory, I bumped the (Java Heap Size of HBase RegionServer) from it default 4GB to 16GB. After the switch I no loger have ReagionServer crashes and applications are performing great.
Later I checked the heap utilization of RegionServer on ClouderaManager Graphs, it uses (12-14 GB) of the acclocated heap. Also my GC graphs show (86s(ms)) which is 100 times lower that ~(23m(ms)). I wish I can share both these graphs.