Created on 08-30-2016 08:15 AM - edited 08-19-2019 03:08 AM
We have HDP 2.4 cluster running HDFS, Yarn and HBase on 3 master and 4 data nodes.
Each data node hosts HBase RegionServer(8GB heap), HDFS Datanode, and Yarn Nodemanager. Each data node is amazon's d2.xlarge.
All master have ZK runnings. Other master processes are HDFS(HA), Hbase and Yarn(HA) masters. Each master node is amazon's r3.xlarge.
We see the following problems with two of our data nodes while other nodes function properly. Please note that MR or yarn jobs are not running when this happens :
1. Region Server dies with Zookeeper session timeout exceptions once in a while
2016-08-29 07:08:50,713 INFO [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 600097ms for sessionid 0x156d486e2120012, closing socket connection and attempting reconnect 2016-08-29 07:09:00,955 INFO [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.252/172.31.103.252:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:09:01,824 WARN [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://ip-172-31-103-252.us-west-2.compute.internal:6188/ws/v1/timeline/metrics 2016-08-29 07:09:01,824 WARN [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://ip-172-31-103-252.us-west-2.compute.internal:6188/ws/v1/timeline/metrics 2016-08-29 07:09:01,825 WARN [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://ip-172-31-103-252.us-west-2.compute.internal:6188/ws/v1/timeline/metrics 2016-08-29 07:09:01,825 WARN [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://ip-172-31-103-252.us-west-2.compute.internal:6188/ws/v1/timeline/metrics 2016-08-29 07:09:01,825 WARN [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://ip-172-31-103-252.us-west-2.compute.internal:6188/ws/v1/timeline/metrics 2016-08-29 07:09:01,825 WARN [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://ip-172-31-103-252.us-west-2.compute.internal:6188/ws/v1/timeline/metrics 2016-08-29 07:09:01,826 WARN [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://ip-172-31-103-252.us-west-2.compute.internal:6188/ws/v1/timeline/metrics 2016-08-29 07:09:03,952 WARN [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:09:14,960 INFO [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.171/172.31.103.171:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:09:16,808 WARN [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:09:18,061 INFO [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Opening socket connection to server ip-172-31-103-112.us-west-2.compute.internal/172.31.103.112:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:09:21,060 WARN [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:09:21,399 INFO [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.252/172.31.103.252:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:09:23,640 WARN [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:09:24,182 INFO [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.171/172.31.103.171:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:09:27,180 WARN [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:09:28,949 INFO [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Opening socket connection to server ip-172-31-103-112.us-west-2.compute.internal/172.31.103.112:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:09:31,948 WARN [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:09:32,446 INFO [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.252/172.31.103.252:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:09:35,444 WARN [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:09:36,208 INFO [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.171/172.31.103.171:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:09:39,024 INFO [main-SendThread(ip-172-31-103-252.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 600081ms for sessionid 0x356d4878aa0001a, closing socket connection and attempting reconnect 2016-08-29 07:09:39,125 WARN [ReplicationExecutor-0] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/replication/rs/ip-172-31-103-124.us-west-2.compute.internal,16020,1472451828166 2016-08-29 07:09:39,125 WARN [PriorityRpcServer.handler=4,queue=0,port=16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/recovering-regions/0e281d12463252983d18abbe9e096fbd 2016-08-29 07:09:39,125 WARN [RS_OPEN_REGION-ip-172-31-103-48:16020-0] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/region-in-transition/4ef6634b001b40cd44c40c8406d6d389 2016-08-29 07:09:39,208 WARN [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:09:40,409 INFO [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Opening socket connection to server ip-172-31-103-112.us-west-2.compute.internal/172.31.103.112:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:09:43,408 WARN [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:09:44,155 INFO [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.252/172.31.103.252:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:09:47,156 WARN [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:09:47,974 INFO [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.171/172.31.103.171:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:09:49,576 WARN [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:09:49,876 INFO [main-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.171/172.31.103.171:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:09:51,266 INFO [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Opening socket connection to server ip-172-31-103-112.us-west-2.compute.internal/172.31.103.112:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:09:52,876 WARN [main-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Session 0x356d4878aa0001a for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:09:52,976 WARN [RS_OPEN_REGION-ip-172-31-103-48:16020-0] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/region-in-transition/4ef6634b001b40cd44c40c8406d6d389 2016-08-29 07:09:52,976 WARN [ReplicationExecutor-0] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/replication/rs/ip-172-31-103-124.us-west-2.compute.internal,16020,1472451828166 2016-08-29 07:09:52,976 WARN [PriorityRpcServer.handler=4,queue=0,port=16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/recovering-regions/0e281d12463252983d18abbe9e096fbd 2016-08-29 07:09:54,264 WARN [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:09:54,579 INFO [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.252/172.31.103.252:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:09:57,580 WARN [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:09:58,048 INFO [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.171/172.31.103.171:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:10:01,048 WARN [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:10:02,282 INFO [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Opening socket connection to server ip-172-31-103-112.us-west-2.compute.internal/172.31.103.112:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:10:03,680 INFO [main-SendThread(172.31.103.112:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.112/172.31.103.112:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:10:05,280 WARN [main-SendThread(172.31.103.112:2181)] zookeeper.ClientCnxn: Session 0x356d4878aa0001a for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:10:05,280 WARN [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:10:05,380 WARN [PriorityRpcServer.handler=4,queue=0,port=16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/recovering-regions/0e281d12463252983d18abbe9e096fbd 2016-08-29 07:10:05,380 WARN [RS_OPEN_REGION-ip-172-31-103-48:16020-0] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/region-in-transition/4ef6634b001b40cd44c40c8406d6d389 2016-08-29 07:10:05,380 WARN [ReplicationExecutor-0] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/replication/rs/ip-172-31-103-124.us-west-2.compute.internal,16020,1472451828166 2016-08-29 07:10:05,540 INFO [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.252/172.31.103.252:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:10:06,459 INFO [main-SendThread(ip-172-31-103-252.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Opening socket connection to server ip-172-31-103-252.us-west-2.compute.internal/172.31.103.252:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:10:08,540 WARN [main-SendThread(ip-172-31-103-252.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Session 0x356d4878aa0001a for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:10:08,540 WARN [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:10:09,000 INFO [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.171/172.31.103.171:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:10:09,187 INFO [main-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.171/172.31.103.171:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:10:12,000 WARN [main-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Session 0x356d4878aa0001a for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:10:12,000 WARN [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:10:12,101 WARN [ReplicationExecutor-0] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/replication/rs/ip-172-31-103-124.us-west-2.compute.internal,16020,1472451828166 2016-08-29 07:10:12,101 WARN [PriorityRpcServer.handler=4,queue=0,port=16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/recovering-regions/0e281d12463252983d18abbe9e096fbd 2016-08-29 07:10:12,101 WARN [RS_OPEN_REGION-ip-172-31-103-48:16020-0] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/region-in-transition/4ef6634b001b40cd44c40c8406d6d389 2016-08-29 07:10:12,295 INFO [main-SendThread(172.31.103.112:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.112/172.31.103.112:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:10:13,750 INFO [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Opening socket connection to server ip-172-31-103-112.us-west-2.compute.internal/172.31.103.112:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:10:15,292 WARN [main-SendThread(172.31.103.112:2181)] zookeeper.ClientCnxn: Session 0x356d4878aa0001a for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:10:15,292 WARN [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:10:15,654 INFO [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.252/172.31.103.252:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:10:17,322 INFO [main-SendThread(ip-172-31-103-252.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Opening socket connection to server ip-172-31-103-252.us-west-2.compute.internal/172.31.103.252:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:10:18,652 WARN [main-SendThread(ip-172-31-103-252.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Session 0x356d4878aa0001a for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:10:18,652 WARN [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:10:19,095 INFO [main-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.171/172.31.103.171:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:10:19,620 INFO [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.171/172.31.103.171:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:10:22,096 WARN [main-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Session 0x356d4878aa0001a for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:10:22,096 WARN [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:10:22,196 WARN [PriorityRpcServer.handler=4,queue=0,port=16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/recovering-regions/0e281d12463252983d18abbe9e096fbd 2016-08-29 07:10:22,196 WARN [ReplicationExecutor-0] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/replication/rs/ip-172-31-103-124.us-west-2.compute.internal,16020,1472451828166 2016-08-29 07:10:22,196 WARN [RS_OPEN_REGION-ip-172-31-103-48:16020-0] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/region-in-transition/4ef6634b001b40cd44c40c8406d6d389 2016-08-29 07:10:22,196 ERROR [RS_OPEN_REGION-ip-172-31-103-48:16020-0] zookeeper.RecoverableZooKeeper: ZooKeeper getData failed after 4 attempts 2016-08-29 07:10:22,196 ERROR [PriorityRpcServer.handler=4,queue=0,port=16020] zookeeper.RecoverableZooKeeper: ZooKeeper getData failed after 4 attempts 2016-08-29 07:10:22,196 ERROR [ReplicationExecutor-0] zookeeper.RecoverableZooKeeper: ZooKeeper getChildren failed after 4 attempts 2016-08-29 07:10:22,196 WARN [PriorityRpcServer.handler=4,queue=0,port=16020] zookeeper.ZKUtil: regionserver:16020-0x356d4878aa0001a, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, baseZNode=/hbase-unsecure Unable to get data of znode /hbase-unsecure/recovering-regions/0e281d12463252983d18abbe9e096fbd org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/recovering-regions/0e281d12463252983d18abbe9e096fbd at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:672) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:648) at org.apache.hadoop.hbase.zookeeper.ZKSplitLog.isRegionMarkedRecoveringInZK(ZKSplitLog.java:159) at org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:1494) at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22239) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) at java.lang.Thread.run(Thread.java:745) 2016-08-29 07:10:22,196 WARN [RS_OPEN_REGION-ip-172-31-103-48:16020-0] zookeeper.ZKUtil: regionserver:16020-0x356d4878aa0001a, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, baseZNode=/hbase-unsecure Unable to get data of znode /hbase-unsecure/region-in-transition/4ef6634b001b40cd44c40c8406d6d389 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/region-in-transition/4ef6634b001b40cd44c40c8406d6d389 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:711) at org.apache.hadoop.hbase.zookeeper.ZKAssign.confirmNodeOpening(ZKAssign.java:652) at org.apache.hadoop.hbase.coordination.ZkOpenRegionCoordination.tickleOpening(ZkOpenRegionCoordination.java:160) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler$1.progress(OpenRegionHandler.java:371) at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEdits(HRegion.java:4189) at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsIfAny(HRegion.java:3953) at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionStores(HRegion.java:949) at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:819) at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:794) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6328) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6289) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6260) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6216) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6167) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:362) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:129) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-08-29 07:10:22,196 WARN [ReplicationExecutor-0] replication.ReplicationQueuesZKImpl: Got exception in copyQueuesFromRSUsingMulti: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/replication/rs/ip-172-31-103-124.us-west-2.compute.internal,16020,1472451828166 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:295) at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenNoWatch(ZKUtil.java:511) at org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.copyQueuesFromRSUsingMulti(ReplicationQueuesZKImpl.java:300) at org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.claimQueues(ReplicationQueuesZKImpl.java:172) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$NodeFailoverWorker.run(ReplicationSourceManager.java:570) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-08-29 07:10:22,197 ERROR [RS_OPEN_REGION-ip-172-31-103-48:16020-0] zookeeper.ZooKeeperWatcher: regionserver:16020-0x356d4878aa0001a, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, baseZNode=/hbase-unsecure Received unexpected KeeperException, re-throwing exception org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/region-in-transition/4ef6634b001b40cd44c40c8406d6d389 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:711) at org.apache.hadoop.hbase.zookeeper.ZKAssign.confirmNodeOpening(ZKAssign.java:652) at org.apache.hadoop.hbase.coordination.ZkOpenRegionCoordination.tickleOpening(ZkOpenRegionCoordination.java:160) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler$1.progress(OpenRegionHandler.java:371) at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEdits(HRegion.java:4189) at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsIfAny(HRegion.java:3953) at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionStores(HRegion.java:949) at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:819) at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:794) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6328) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6289) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6260) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6216) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6167) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:362) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:129) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-08-29 07:10:22,197 ERROR [PriorityRpcServer.handler=4,queue=0,port=16020] zookeeper.ZooKeeperWatcher: regionserver:16020-0x356d4878aa0001a, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, baseZNode=/hbase-unsecure Received unexpected KeeperException, re-throwing exception org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/recovering-regions/0e281d12463252983d18abbe9e096fbd at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:672) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:648) at org.apache.hadoop.hbase.zookeeper.ZKSplitLog.isRegionMarkedRecoveringInZK(ZKSplitLog.java:159) at org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:1494) at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22239) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) at java.lang.Thread.run(Thread.java:745) 2016-08-29 07:10:22,197 FATAL [RS_OPEN_REGION-ip-172-31-103-48:16020-0] regionserver.HRegionServer: ABORTING region server ip-172-31-103-48.us-west-2.compute.internal,16020,1472451821818: Exception refreshing OPENING; region=4ef6634b001b40cd44c40c8406d6d389, context=open_region_progress org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/region-in-transition/4ef6634b001b40cd44c40c8406d6d389 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:711) at org.apache.hadoop.hbase.zookeeper.ZKAssign.confirmNodeOpening(ZKAssign.java:652) at org.apache.hadoop.hbase.coordination.ZkOpenRegionCoordination.tickleOpening(ZkOpenRegionCoordination.java:160) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler$1.progress(OpenRegionHandler.java:371) at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEdits(HRegion.java:4189) at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsIfAny(HRegion.java:3953) at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionStores(HRegion.java:949) at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:819) at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:794) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6328) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6289) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6260) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6216) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6167) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:362) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:129) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-08-29 07:10:22,197 ERROR [PriorityRpcServer.handler=4,queue=0,port=16020] regionserver.RSRpcServices: Can't retrieve recovering state from zookeeper org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/recovering-regions/0e281d12463252983d18abbe9e096fbd at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:672) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:648) at org.apache.hadoop.hbase.zookeeper.ZKSplitLog.isRegionMarkedRecoveringInZK(ZKSplitLog.java:159) at org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:1494) at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22239) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) at java.lang.Thread.run(Thread.java:745) 2016-08-29 07:10:22,198 FATAL [RS_OPEN_REGION-ip-172-31-103-48:16020-0] regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: [org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint] 2016-08-29 07:10:22,198 ERROR [PriorityRpcServer.handler=4,queue=0,port=16020] ipc.RpcServer: Unexpected throwable object org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/recovering-regions/0e281d12463252983d18abbe9e096fbd at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:672) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:648) at org.apache.hadoop.hbase.zookeeper.ZKSplitLog.isRegionMarkedRecoveringInZK(ZKSplitLog.java:159) at org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:1494) at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22239) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) at java.lang.Thread.run(Thread.java:745) 2016-08-29 07:10:22,209 INFO [RS_OPEN_REGION-ip-172-31-103-48:16020-0] regionserver.HRegionServer: Dump of metrics as JSON on abort: { "beans" : [ { "name" : "java.lang:type=Memory", "modelerType" : "sun.management.MemoryImpl", "Verbose" : true, "ObjectPendingFinalizationCount" : 0, "NonHeapMemoryUsage" : { "committed" : 81408000, "init" : 2555904, "max" : -1, "used" : 80115416 }, "HeapMemoryUsage" : { "committed" : 8536260608, "init" : 8589934592, "max" : 8536260608, "used" : 1738968880 }, "ObjectName" : "java.lang:type=Memory" } ], "beans" : [ { "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC", "modelerType" : "RegionServer,sub=IPC", "tag.Context" : "regionserver", "tag.Hostname" : "ip-172-31-103-48", "queueSize" : 0, "numCallsInGeneralQueue" : 0, "numCallsInReplicationQueue" : 0, "numCallsInPriorityQueue" : 0, "numOpenConnections" : 1, "numActiveHandler" : 0, "receivedBytes" : 1190510401, "exceptions.RegionMovedException" : 10, "authenticationSuccesses" : 0, "authorizationFailures" : 0, "TotalCallTime_num_ops" : 5758, "TotalCallTime_min" : 0, "TotalCallTime_max" : 69392, "TotalCallTime_mean" : 29.966828759986107, "TotalCallTime_median" : 3.0, "TotalCallTime_75th_percentile" : 6.0, "TotalCallTime_95th_percentile" : 11.0, "TotalCallTime_99th_percentile" : 17.0, "exceptions.RegionTooBusyException" : 0, "exceptions.FailedSanityCheckException" : 0, "exceptions.UnknownScannerException" : 0, "exceptions.OutOfOrderScannerNextException" : 0, "exceptions" : 11, "ProcessCallTime_num_ops" : 5758, "ProcessCallTime_min" : 0, "ProcessCallTime_max" : 69391, "ProcessCallTime_mean" : 29.88711358110455, "ProcessCallTime_median" : 3.0, "ProcessCallTime_75th_percentile" : 6.0, "ProcessCallTime_95th_percentile" : 11.0, "ProcessCallTime_99th_percentile" : 17.0, "exceptions.NotServingRegionException" : 0, "authorizationSuccesses" : 4, "sentBytes" : 2445857, "QueueCallTime_num_ops" : 5758, "QueueCallTime_min" : 0, "QueueCallTime_max" : 10, "QueueCallTime_mean" : 0.07971517888155609, "QueueCallTime_median" : 0.0, "QueueCallTime_75th_percentile" : 0.0, "QueueCallTime_95th_percentile" : 1.0, "QueueCallTime_99th_percentile" : 1.0, "authenticationFailures" : 0 } ], "beans" : [ { "name" : "Hadoop:service=HBase,name=RegionServer,sub=Replication", "modelerType" : "RegionServer,sub=Replication", "tag.Context" : "regionserver", "tag.Hostname" : "ip-172-31-103-48", "sink.appliedOps" : 0, "sink.ageOfLastAppliedOp" : 0, "sink.appliedBatches" : 0 } ], "beans" : [ { "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server", "modelerType" : "RegionServer,sub=Server", "tag.zookeeperQuorum" : "ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181", "tag.serverName" : "ip-172-31-103-48.us-west-2.compute.internal,16020,1472451821818", "tag.clusterId" : "aa465b2d-db65-4316-87b9-fff8ca04e997", "tag.Context" : "regionserver", "tag.Hostname" : "ip-172-31-103-48", "regionCount" : 168, "storeCount" : 168, "hlogFileCount" : 11, "hlogFileSize" : 1180272772, "storeFileCount" : 278, "memStoreSize" : 1010265400, "storeFileSize" : 112800764928, "regionServerStartTime" : 1472451821818, "totalRequestCount" : 361717, "readRequestCount" : 0, "writeRequestCount" : 335401, "checkMutateFailedCount" : 0, "checkMutatePassedCount" : 0, "storeFileIndexSize" : 4179624, "staticIndexSize" : 502936093, "staticBloomSize" : 289169680, "mutationsWithoutWALCount" : 0, "mutationsWithoutWALSize" : 0, "percentFilesLocal" : 83, "percentFilesLocalSecondaryRegions" : 0, "splitQueueLength" : 0, "compactionQueueLength" : 1, "flushQueueLength" : 0, "blockCacheFreeSize" : 3403982448, "blockCacheCount" : 63, "blockCacheSize" : 10521744, "blockCacheHitCount" : 78562, "blockCacheHitCountPrimary" : 78562, "blockCacheMissCount" : 169129, "blockCacheMissCountPrimary" : 169129, "blockCacheEvictionCount" : 0, "blockCacheEvictionCountPrimary" : 0, "blockCacheCountHitPercent" : 31.0, "blockCacheExpressHitPercent" : 99, "updatesBlockedTime" : 0, "flushedCellsCount" : 40535, "compactedCellsCount" : 2779805, "majorCompactedCellsCount" : 247649, "flushedCellsSize" : 152286384, "compactedCellsSize" : 8624831557, "majorCompactedCellsSize" : 825856378, "blockedRequestCount" : 0, "Mutate_num_ops" : 12673, "Mutate_min" : 0, "Mutate_max" : 69383, "Mutate_mean" : 13.328335832083958, "Mutate_median" : 2.0, "Mutate_75th_percentile" : 3.0, "Mutate_95th_percentile" : 5.0, "Mutate_99th_percentile" : 10.0, "slowAppendCount" : 0, "slowDeleteCount" : 0, "Increment_num_ops" : 0, "Increment_min" : 0, "Increment_max" : 0, "Increment_mean" : 0.0, "Increment_median" : 0.0, "Increment_75th_percentile" : 0.0, "Increment_95th_percentile" : 0.0, "Increment_99th_percentile" : 0.0, "Replay_num_ops" : 0, "Replay_min" : 0, "Replay_max" : 0, "Replay_mean" : 0.0, "Replay_median" : 0.0, "Replay_75th_percentile" : 0.0, "Replay_95th_percentile" : 0.0, "Replay_99th_percentile" : 0.0, "FlushTime_num_ops" : 1, "FlushTime_min" : 70197, "FlushTime_max" : 70197, "FlushTime_mean" : 70197.0, "FlushTime_median" : 70197.0, "FlushTime_75th_percentile" : 70197.0, "FlushTime_95th_percentile" : 70197.0, "FlushTime_99th_percentile" : 70197.0, "Delete_num_ops" : 0, "Delete_min" : 0, "Delete_max" : 0, "Delete_mean" : 0.0, "Delete_median" : 0.0, "Delete_75th_percentile" : 0.0, "Delete_95th_percentile" : 0.0, "Delete_99th_percentile" : 0.0, "splitRequestCount" : 0, "splitSuccessCount" : 0, "slowGetCount" : 0, "Get_num_ops" : 0, "Get_min" : 0, "Get_max" : 0, "Get_mean" : 0.0, "Get_median" : 0.0, "Get_75th_percentile" : 0.0, "Get_95th_percentile" : 0.0, "Get_99th_percentile" : 0.0, "ScanNext_num_ops" : 0, "ScanNext_min" : 0, "ScanNext_max" : 0, "ScanNext_mean" : 0.0, "ScanNext_median" : 0.0, "ScanNext_75th_percentile" : 0.0, "ScanNext_95th_percentile" : 0.0, "ScanNext_99th_percentile" : 0.0, "slowPutCount" : 2, "slowIncrementCount" : 0, "Append_num_ops" : 0, "Append_min" : 0, "Append_max" : 0, "Append_mean" : 0.0, "Append_median" : 0.0, "Append_75th_percentile" : 0.0, "Append_95th_percentile" : 0.0, "Append_99th_percentile" : 0.0, "SplitTime_num_ops" : 0, "SplitTime_min" : 0, "SplitTime_max" : 0, "SplitTime_mean" : 0.0, "SplitTime_median" : 0.0, "SplitTime_75th_percentile" : 0.0, "SplitTime_95th_percentile" : 0.0, "SplitTime_99th_percentile" : 0.0 } ] } 2016-08-29 07:10:22,209 INFO [RS_OPEN_REGION-ip-172-31-103-48:16020-0] regionserver.HRegionServer: STOPPED: Exception refreshing OPENING; region=4ef6634b001b40cd44c40c8406d6d389, context=open_region_progress 2016-08-29 07:10:22,526 INFO [main-SendThread(172.31.103.112:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.112/172.31.103.112:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:10:23,354 INFO [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Opening socket connection to server ip-172-31-103-112.us-west-2.compute.internal/172.31.103.112:2181. Will not attempt to authenticate using SASL (unknown error) 2016-08-29 07:10:24,403 INFO [ip-172-31-103-48.us-west-2.compute.internal,16020,1472451821818_ChoreService_1] regionserver.HRegionServer$CompactionChecker: Chore: CompactionChecker was stopped 2016-08-29 07:10:24,404 INFO [ip-172-31-103-48.us-west-2.compute.internal,16020,1472451821818_ChoreService_1] regionserver.HRegionServer$PeriodicMemstoreFlusher: Chore: ip-172-31-103-48.us-west-2.compute.internal,16020,1472451821818-MemstoreFlusherChore was stopped 2016-08-29 07:10:24,783 INFO [MemStoreFlusher.0] regionserver.MemStoreFlusher: MemStoreFlusher.0 exiting 2016-08-29 07:10:25,524 WARN [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2016-08-29 07:10:25,524 WARN [main-SendThread(172.31.103.112:2181)] zookeeper.ClientCnxn: Session 0x356d4878aa0001a for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2. Yarn node manager dies less frequently but with similar network connection issues
2016-08-29 15:47:13,951 FATAL nodemanager.NodeManager (NodeManager.java:run(360)) - Error while rebooting NodeStatusUpdater. org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.NoRouteToHostException: No Route to Host from java.net.UnknownHostException: ip-172-31-103-48: ip-172-31-103-48: unknown error to ip-172-31-103-112.us-west-2.compute.internal:8031 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see: http://wiki.apache.org/hadoop/NoRouteToHost at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.rebootNodeStatusUpdaterAndRegisterWithRM(NodeStatusUpdaterImpl.java:254) at org.apache.hadoop.yarn.server.nodemanager.NodeManager$2.run(NodeManager.java:357) Caused by: java.net.NoRouteToHostException: No Route to Host from java.net.UnknownHostException: ip-172-31-103-48: ip-172-31-103-48: unknown error to ip-172-31-103-112.us-west-2.compute.internal:8031 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see: http://wiki.apache.org/hadoop/NoRouteToHost at sun.reflect.GeneratedConstructorAccessor34.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:758) at org.apache.hadoop.ipc.Client.call(Client.java:1430) at org.apache.hadoop.ipc.Client.call(Client.java:1363) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy82.registerNodeManager(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:68) at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) at com.sun.proxy.$Proxy83.registerNodeManager(Unknown Source) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:296) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.rebootNodeStatusUpdaterAndRegisterWithRM(NodeStatusUpdaterImpl.java:246) ... 1 more Caused by: java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:617) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:715) at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:378) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1492) at org.apache.hadoop.ipc.Client.call(Client.java:1402) ... 13 more 2016-08-29 15:47:13,961 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:8042
3. Node becomes unresponsive sometimes(mostly after region server failure and continues to come on and off even though region server is no longer running), you can't login to it. AWS instance check fails. It comes back after few mins/few hours.
There is surely some network misconfiguration in the cluster but above issues happens only on few machines. The cluster is running in VPC. ulimit and nproc are set to 32768 and 65536 respectively. Most host metrics looks normal
Any ideas on debugging this would be greatly appreciated.
Thanks
Created 09-01-2016 07:58 PM
It was DHCP failing to see a response from the DHCP server for periods of time. d2 Ubuntu(14.04) instances were using Enhanced Networking and the "ixgbevf" driver 2.11.3-k. 2.11.3-k is below the minimum recommended version 2.14.2 and should be upgraded to 2.16.4. We upgraded the driver to the latest version which seems to have fixed the issue.
Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/sriov-networking.html#enhanced-networking-ubuntu
Created 08-30-2016 02:53 PM
w.r.t. AWS instance check failure, have you looked at
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/TroubleshootingInstances.html
Created 08-30-2016 03:52 PM
NoRouteToHostException occurred in both region server and node manager logs.
Please check network connectivity.
Created 09-01-2016 07:58 PM
It was DHCP failing to see a response from the DHCP server for periods of time. d2 Ubuntu(14.04) instances were using Enhanced Networking and the "ixgbevf" driver 2.11.3-k. 2.11.3-k is below the minimum recommended version 2.14.2 and should be upgraded to 2.16.4. We upgraded the driver to the latest version which seems to have fixed the issue.
Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/sriov-networking.html#enhanced-networking-ubuntu