Support Questions

Find answers, ask questions, and share your expertise

How to debug cluster network issues?

avatar

We have HDP 2.4 cluster running HDFS, Yarn and HBase on 3 master and 4 data nodes.

Each data node hosts HBase RegionServer(8GB heap), HDFS Datanode, and Yarn Nodemanager. Each data node is amazon's d2.xlarge.

All master have ZK runnings. Other master processes are HDFS(HA), Hbase and Yarn(HA) masters. Each master node is amazon's r3.xlarge.

We see the following problems with two of our data nodes while other nodes function properly. Please note that MR or yarn jobs are not running when this happens :

1. Region Server dies with Zookeeper session timeout exceptions once in a while

2016-08-29 07:08:50,713 INFO  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 600097ms for sessionid 0x156d486e2120012, closing socket connection and attempting reconnect
2016-08-29 07:09:00,955 INFO  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.252/172.31.103.252:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:09:01,824 WARN  [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://ip-172-31-103-252.us-west-2.compute.internal:6188/ws/v1/timeline/metrics
2016-08-29 07:09:01,824 WARN  [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://ip-172-31-103-252.us-west-2.compute.internal:6188/ws/v1/timeline/metrics
2016-08-29 07:09:01,825 WARN  [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://ip-172-31-103-252.us-west-2.compute.internal:6188/ws/v1/timeline/metrics
2016-08-29 07:09:01,825 WARN  [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://ip-172-31-103-252.us-west-2.compute.internal:6188/ws/v1/timeline/metrics
2016-08-29 07:09:01,825 WARN  [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://ip-172-31-103-252.us-west-2.compute.internal:6188/ws/v1/timeline/metrics
2016-08-29 07:09:01,825 WARN  [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://ip-172-31-103-252.us-west-2.compute.internal:6188/ws/v1/timeline/metrics
2016-08-29 07:09:01,826 WARN  [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://ip-172-31-103-252.us-west-2.compute.internal:6188/ws/v1/timeline/metrics



2016-08-29 07:09:03,952 WARN  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:09:14,960 INFO  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.171/172.31.103.171:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:09:16,808 WARN  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:09:18,061 INFO  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Opening socket connection to server ip-172-31-103-112.us-west-2.compute.internal/172.31.103.112:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:09:21,060 WARN  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:09:21,399 INFO  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.252/172.31.103.252:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:09:23,640 WARN  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:09:24,182 INFO  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.171/172.31.103.171:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:09:27,180 WARN  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:09:28,949 INFO  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Opening socket connection to server ip-172-31-103-112.us-west-2.compute.internal/172.31.103.112:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:09:31,948 WARN  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:09:32,446 INFO  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.252/172.31.103.252:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:09:35,444 WARN  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:09:36,208 INFO  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.171/172.31.103.171:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:09:39,024 INFO  [main-SendThread(ip-172-31-103-252.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 600081ms for sessionid 0x356d4878aa0001a, closing socket connection and attempting reconnect
2016-08-29 07:09:39,125 WARN  [ReplicationExecutor-0] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/replication/rs/ip-172-31-103-124.us-west-2.compute.internal,16020,1472451828166
2016-08-29 07:09:39,125 WARN  [PriorityRpcServer.handler=4,queue=0,port=16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/recovering-regions/0e281d12463252983d18abbe9e096fbd
2016-08-29 07:09:39,125 WARN  [RS_OPEN_REGION-ip-172-31-103-48:16020-0] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/region-in-transition/4ef6634b001b40cd44c40c8406d6d389
2016-08-29 07:09:39,208 WARN  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:09:40,409 INFO  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Opening socket connection to server ip-172-31-103-112.us-west-2.compute.internal/172.31.103.112:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:09:43,408 WARN  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:09:44,155 INFO  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.252/172.31.103.252:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:09:47,156 WARN  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:09:47,974 INFO  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.171/172.31.103.171:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:09:49,576 WARN  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:09:49,876 INFO  [main-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.171/172.31.103.171:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:09:51,266 INFO  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Opening socket connection to server ip-172-31-103-112.us-west-2.compute.internal/172.31.103.112:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:09:52,876 WARN  [main-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Session 0x356d4878aa0001a for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:09:52,976 WARN  [RS_OPEN_REGION-ip-172-31-103-48:16020-0] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/region-in-transition/4ef6634b001b40cd44c40c8406d6d389
2016-08-29 07:09:52,976 WARN  [ReplicationExecutor-0] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/replication/rs/ip-172-31-103-124.us-west-2.compute.internal,16020,1472451828166
2016-08-29 07:09:52,976 WARN  [PriorityRpcServer.handler=4,queue=0,port=16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/recovering-regions/0e281d12463252983d18abbe9e096fbd
2016-08-29 07:09:54,264 WARN  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:09:54,579 INFO  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.252/172.31.103.252:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:09:57,580 WARN  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:09:58,048 INFO  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.171/172.31.103.171:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:10:01,048 WARN  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:10:02,282 INFO  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Opening socket connection to server ip-172-31-103-112.us-west-2.compute.internal/172.31.103.112:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:10:03,680 INFO  [main-SendThread(172.31.103.112:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.112/172.31.103.112:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:10:05,280 WARN  [main-SendThread(172.31.103.112:2181)] zookeeper.ClientCnxn: Session 0x356d4878aa0001a for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:10:05,280 WARN  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:10:05,380 WARN  [PriorityRpcServer.handler=4,queue=0,port=16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/recovering-regions/0e281d12463252983d18abbe9e096fbd
2016-08-29 07:10:05,380 WARN  [RS_OPEN_REGION-ip-172-31-103-48:16020-0] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/region-in-transition/4ef6634b001b40cd44c40c8406d6d389
2016-08-29 07:10:05,380 WARN  [ReplicationExecutor-0] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/replication/rs/ip-172-31-103-124.us-west-2.compute.internal,16020,1472451828166
2016-08-29 07:10:05,540 INFO  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.252/172.31.103.252:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:10:06,459 INFO  [main-SendThread(ip-172-31-103-252.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Opening socket connection to server ip-172-31-103-252.us-west-2.compute.internal/172.31.103.252:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:10:08,540 WARN  [main-SendThread(ip-172-31-103-252.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Session 0x356d4878aa0001a for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:10:08,540 WARN  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:10:09,000 INFO  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.171/172.31.103.171:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:10:09,187 INFO  [main-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.171/172.31.103.171:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:10:12,000 WARN  [main-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Session 0x356d4878aa0001a for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:10:12,000 WARN  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:10:12,101 WARN  [ReplicationExecutor-0] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/replication/rs/ip-172-31-103-124.us-west-2.compute.internal,16020,1472451828166
2016-08-29 07:10:12,101 WARN  [PriorityRpcServer.handler=4,queue=0,port=16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/recovering-regions/0e281d12463252983d18abbe9e096fbd
2016-08-29 07:10:12,101 WARN  [RS_OPEN_REGION-ip-172-31-103-48:16020-0] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/region-in-transition/4ef6634b001b40cd44c40c8406d6d389
2016-08-29 07:10:12,295 INFO  [main-SendThread(172.31.103.112:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.112/172.31.103.112:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:10:13,750 INFO  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Opening socket connection to server ip-172-31-103-112.us-west-2.compute.internal/172.31.103.112:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:10:15,292 WARN  [main-SendThread(172.31.103.112:2181)] zookeeper.ClientCnxn: Session 0x356d4878aa0001a for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:10:15,292 WARN  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:10:15,654 INFO  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.252/172.31.103.252:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:10:17,322 INFO  [main-SendThread(ip-172-31-103-252.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Opening socket connection to server ip-172-31-103-252.us-west-2.compute.internal/172.31.103.252:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:10:18,652 WARN  [main-SendThread(ip-172-31-103-252.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Session 0x356d4878aa0001a for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:10:18,652 WARN  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.252:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:10:19,095 INFO  [main-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.171/172.31.103.171:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:10:19,620 INFO  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.171/172.31.103.171:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:10:22,096 WARN  [main-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Session 0x356d4878aa0001a for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:10:22,096 WARN  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(172.31.103.171:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2016-08-29 07:10:22,196 WARN  [PriorityRpcServer.handler=4,queue=0,port=16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/recovering-regions/0e281d12463252983d18abbe9e096fbd
2016-08-29 07:10:22,196 WARN  [ReplicationExecutor-0] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/replication/rs/ip-172-31-103-124.us-west-2.compute.internal,16020,1472451828166
2016-08-29 07:10:22,196 WARN  [RS_OPEN_REGION-ip-172-31-103-48:16020-0] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/region-in-transition/4ef6634b001b40cd44c40c8406d6d389
2016-08-29 07:10:22,196 ERROR [RS_OPEN_REGION-ip-172-31-103-48:16020-0] zookeeper.RecoverableZooKeeper: ZooKeeper getData failed after 4 attempts
2016-08-29 07:10:22,196 ERROR [PriorityRpcServer.handler=4,queue=0,port=16020] zookeeper.RecoverableZooKeeper: ZooKeeper getData failed after 4 attempts
2016-08-29 07:10:22,196 ERROR [ReplicationExecutor-0] zookeeper.RecoverableZooKeeper: ZooKeeper getChildren failed after 4 attempts
2016-08-29 07:10:22,196 WARN  [PriorityRpcServer.handler=4,queue=0,port=16020] zookeeper.ZKUtil: regionserver:16020-0x356d4878aa0001a, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, baseZNode=/hbase-unsecure Unable to get data of znode /hbase-unsecure/recovering-regions/0e281d12463252983d18abbe9e096fbd
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/recovering-regions/0e281d12463252983d18abbe9e096fbd
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
 at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359)
 at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:672)
 at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:648)
 at org.apache.hadoop.hbase.zookeeper.ZKSplitLog.isRegionMarkedRecoveringInZK(ZKSplitLog.java:159)
 at org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:1494)
 at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22239)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
 at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
 at java.lang.Thread.run(Thread.java:745)
2016-08-29 07:10:22,196 WARN  [RS_OPEN_REGION-ip-172-31-103-48:16020-0] zookeeper.ZKUtil: regionserver:16020-0x356d4878aa0001a, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, baseZNode=/hbase-unsecure Unable to get data of znode /hbase-unsecure/region-in-transition/4ef6634b001b40cd44c40c8406d6d389
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/region-in-transition/4ef6634b001b40cd44c40c8406d6d389
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
 at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359)
 at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:711)
 at org.apache.hadoop.hbase.zookeeper.ZKAssign.confirmNodeOpening(ZKAssign.java:652)
 at org.apache.hadoop.hbase.coordination.ZkOpenRegionCoordination.tickleOpening(ZkOpenRegionCoordination.java:160)
 at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler$1.progress(OpenRegionHandler.java:371)
 at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEdits(HRegion.java:4189)
 at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsIfAny(HRegion.java:3953)
 at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionStores(HRegion.java:949)
 at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:819)
 at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:794)
 at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6328)
 at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6289)
 at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6260)
 at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6216)
 at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6167)
 at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:362)
 at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:129)
 at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
2016-08-29 07:10:22,196 WARN  [ReplicationExecutor-0] replication.ReplicationQueuesZKImpl: Got exception in copyQueuesFromRSUsingMulti: 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/replication/rs/ip-172-31-103-124.us-west-2.compute.internal,16020,1472451828166
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
 at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:295)
 at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenNoWatch(ZKUtil.java:511)
 at org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.copyQueuesFromRSUsingMulti(ReplicationQueuesZKImpl.java:300)
 at org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.claimQueues(ReplicationQueuesZKImpl.java:172)
 at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$NodeFailoverWorker.run(ReplicationSourceManager.java:570)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
2016-08-29 07:10:22,197 ERROR [RS_OPEN_REGION-ip-172-31-103-48:16020-0] zookeeper.ZooKeeperWatcher: regionserver:16020-0x356d4878aa0001a, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, baseZNode=/hbase-unsecure Received unexpected KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/region-in-transition/4ef6634b001b40cd44c40c8406d6d389
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
 at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359)
 at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:711)
 at org.apache.hadoop.hbase.zookeeper.ZKAssign.confirmNodeOpening(ZKAssign.java:652)
 at org.apache.hadoop.hbase.coordination.ZkOpenRegionCoordination.tickleOpening(ZkOpenRegionCoordination.java:160)
 at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler$1.progress(OpenRegionHandler.java:371)
 at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEdits(HRegion.java:4189)
 at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsIfAny(HRegion.java:3953)
 at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionStores(HRegion.java:949)
 at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:819)
 at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:794)
 at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6328)
 at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6289)
 at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6260)
 at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6216)
 at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6167)
 at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:362)
 at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:129)
 at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
2016-08-29 07:10:22,197 ERROR [PriorityRpcServer.handler=4,queue=0,port=16020] zookeeper.ZooKeeperWatcher: regionserver:16020-0x356d4878aa0001a, quorum=ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181, baseZNode=/hbase-unsecure Received unexpected KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/recovering-regions/0e281d12463252983d18abbe9e096fbd
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
 at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359)
 at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:672)
 at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:648)
 at org.apache.hadoop.hbase.zookeeper.ZKSplitLog.isRegionMarkedRecoveringInZK(ZKSplitLog.java:159)
 at org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:1494)
 at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22239)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
 at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
 at java.lang.Thread.run(Thread.java:745)
2016-08-29 07:10:22,197 FATAL [RS_OPEN_REGION-ip-172-31-103-48:16020-0] regionserver.HRegionServer: ABORTING region server ip-172-31-103-48.us-west-2.compute.internal,16020,1472451821818: Exception refreshing OPENING; region=4ef6634b001b40cd44c40c8406d6d389, context=open_region_progress
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/region-in-transition/4ef6634b001b40cd44c40c8406d6d389
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
 at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359)
 at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:711)
 at org.apache.hadoop.hbase.zookeeper.ZKAssign.confirmNodeOpening(ZKAssign.java:652)
 at org.apache.hadoop.hbase.coordination.ZkOpenRegionCoordination.tickleOpening(ZkOpenRegionCoordination.java:160)
 at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler$1.progress(OpenRegionHandler.java:371)
 at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEdits(HRegion.java:4189)
 at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsIfAny(HRegion.java:3953)
 at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionStores(HRegion.java:949)
 at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:819)
 at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:794)
 at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6328)
 at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6289)
 at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6260)
 at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6216)
 at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6167)
 at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:362)
 at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:129)
 at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
2016-08-29 07:10:22,197 ERROR [PriorityRpcServer.handler=4,queue=0,port=16020] regionserver.RSRpcServices: Can't retrieve recovering state from zookeeper
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/recovering-regions/0e281d12463252983d18abbe9e096fbd
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
 at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359)
 at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:672)
 at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:648)
 at org.apache.hadoop.hbase.zookeeper.ZKSplitLog.isRegionMarkedRecoveringInZK(ZKSplitLog.java:159)
 at org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:1494)
 at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22239)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
 at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
 at java.lang.Thread.run(Thread.java:745)
2016-08-29 07:10:22,198 FATAL [RS_OPEN_REGION-ip-172-31-103-48:16020-0] regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: [org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint]
2016-08-29 07:10:22,198 ERROR [PriorityRpcServer.handler=4,queue=0,port=16020] ipc.RpcServer: Unexpected throwable object 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/recovering-regions/0e281d12463252983d18abbe9e096fbd
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
 at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359)
 at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:672)
 at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:648)
 at org.apache.hadoop.hbase.zookeeper.ZKSplitLog.isRegionMarkedRecoveringInZK(ZKSplitLog.java:159)
 at org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:1494)
 at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22239)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
 at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
 at java.lang.Thread.run(Thread.java:745)
2016-08-29 07:10:22,209 INFO  [RS_OPEN_REGION-ip-172-31-103-48:16020-0] regionserver.HRegionServer: Dump of metrics as JSON on abort: {
  "beans" : [ {
    "name" : "java.lang:type=Memory",
    "modelerType" : "sun.management.MemoryImpl",
    "Verbose" : true,
    "ObjectPendingFinalizationCount" : 0,
    "NonHeapMemoryUsage" : {
      "committed" : 81408000,
      "init" : 2555904,
      "max" : -1,
      "used" : 80115416
    },
    "HeapMemoryUsage" : {
      "committed" : 8536260608,
      "init" : 8589934592,
      "max" : 8536260608,
      "used" : 1738968880
    },
    "ObjectName" : "java.lang:type=Memory"
  } ],
  "beans" : [ {
    "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
    "modelerType" : "RegionServer,sub=IPC",
    "tag.Context" : "regionserver",
    "tag.Hostname" : "ip-172-31-103-48",
    "queueSize" : 0,
    "numCallsInGeneralQueue" : 0,
    "numCallsInReplicationQueue" : 0,
    "numCallsInPriorityQueue" : 0,
    "numOpenConnections" : 1,
    "numActiveHandler" : 0,
    "receivedBytes" : 1190510401,
    "exceptions.RegionMovedException" : 10,
    "authenticationSuccesses" : 0,
    "authorizationFailures" : 0,
    "TotalCallTime_num_ops" : 5758,
    "TotalCallTime_min" : 0,
    "TotalCallTime_max" : 69392,
    "TotalCallTime_mean" : 29.966828759986107,
    "TotalCallTime_median" : 3.0,
    "TotalCallTime_75th_percentile" : 6.0,
    "TotalCallTime_95th_percentile" : 11.0,
    "TotalCallTime_99th_percentile" : 17.0,
    "exceptions.RegionTooBusyException" : 0,
    "exceptions.FailedSanityCheckException" : 0,
    "exceptions.UnknownScannerException" : 0,
    "exceptions.OutOfOrderScannerNextException" : 0,
    "exceptions" : 11,
    "ProcessCallTime_num_ops" : 5758,
    "ProcessCallTime_min" : 0,
    "ProcessCallTime_max" : 69391,
    "ProcessCallTime_mean" : 29.88711358110455,
    "ProcessCallTime_median" : 3.0,
    "ProcessCallTime_75th_percentile" : 6.0,
    "ProcessCallTime_95th_percentile" : 11.0,
    "ProcessCallTime_99th_percentile" : 17.0,
    "exceptions.NotServingRegionException" : 0,
    "authorizationSuccesses" : 4,
    "sentBytes" : 2445857,
    "QueueCallTime_num_ops" : 5758,
    "QueueCallTime_min" : 0,
    "QueueCallTime_max" : 10,
    "QueueCallTime_mean" : 0.07971517888155609,
    "QueueCallTime_median" : 0.0,
    "QueueCallTime_75th_percentile" : 0.0,
    "QueueCallTime_95th_percentile" : 1.0,
    "QueueCallTime_99th_percentile" : 1.0,
    "authenticationFailures" : 0
  } ],
  "beans" : [ {
    "name" : "Hadoop:service=HBase,name=RegionServer,sub=Replication",
    "modelerType" : "RegionServer,sub=Replication",
    "tag.Context" : "regionserver",
    "tag.Hostname" : "ip-172-31-103-48",
    "sink.appliedOps" : 0,
    "sink.ageOfLastAppliedOp" : 0,
    "sink.appliedBatches" : 0
  } ],
  "beans" : [ {
    "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
    "modelerType" : "RegionServer,sub=Server",
    "tag.zookeeperQuorum" : "ip-172-31-103-112.us-west-2.compute.internal:2181,ip-172-31-103-171.us-west-2.compute.internal:2181,ip-172-31-103-252.us-west-2.compute.internal:2181",
    "tag.serverName" : "ip-172-31-103-48.us-west-2.compute.internal,16020,1472451821818",
    "tag.clusterId" : "aa465b2d-db65-4316-87b9-fff8ca04e997",
    "tag.Context" : "regionserver",
    "tag.Hostname" : "ip-172-31-103-48",
    "regionCount" : 168,
    "storeCount" : 168,
    "hlogFileCount" : 11,
    "hlogFileSize" : 1180272772,
    "storeFileCount" : 278,
    "memStoreSize" : 1010265400,
    "storeFileSize" : 112800764928,
    "regionServerStartTime" : 1472451821818,
    "totalRequestCount" : 361717,
    "readRequestCount" : 0,
    "writeRequestCount" : 335401,
    "checkMutateFailedCount" : 0,
    "checkMutatePassedCount" : 0,
    "storeFileIndexSize" : 4179624,
    "staticIndexSize" : 502936093,
    "staticBloomSize" : 289169680,
    "mutationsWithoutWALCount" : 0,
    "mutationsWithoutWALSize" : 0,
    "percentFilesLocal" : 83,
    "percentFilesLocalSecondaryRegions" : 0,
    "splitQueueLength" : 0,
    "compactionQueueLength" : 1,
    "flushQueueLength" : 0,
    "blockCacheFreeSize" : 3403982448,
    "blockCacheCount" : 63,
    "blockCacheSize" : 10521744,
    "blockCacheHitCount" : 78562,
    "blockCacheHitCountPrimary" : 78562,
    "blockCacheMissCount" : 169129,
    "blockCacheMissCountPrimary" : 169129,
    "blockCacheEvictionCount" : 0,
    "blockCacheEvictionCountPrimary" : 0,
    "blockCacheCountHitPercent" : 31.0,
    "blockCacheExpressHitPercent" : 99,
    "updatesBlockedTime" : 0,
    "flushedCellsCount" : 40535,
    "compactedCellsCount" : 2779805,
    "majorCompactedCellsCount" : 247649,
    "flushedCellsSize" : 152286384,
    "compactedCellsSize" : 8624831557,
    "majorCompactedCellsSize" : 825856378,
    "blockedRequestCount" : 0,
    "Mutate_num_ops" : 12673,
    "Mutate_min" : 0,
    "Mutate_max" : 69383,
    "Mutate_mean" : 13.328335832083958,
    "Mutate_median" : 2.0,
    "Mutate_75th_percentile" : 3.0,
    "Mutate_95th_percentile" : 5.0,
    "Mutate_99th_percentile" : 10.0,
    "slowAppendCount" : 0,
    "slowDeleteCount" : 0,
    "Increment_num_ops" : 0,
    "Increment_min" : 0,
    "Increment_max" : 0,
    "Increment_mean" : 0.0,
    "Increment_median" : 0.0,
    "Increment_75th_percentile" : 0.0,
    "Increment_95th_percentile" : 0.0,
    "Increment_99th_percentile" : 0.0,
    "Replay_num_ops" : 0,
    "Replay_min" : 0,
    "Replay_max" : 0,
    "Replay_mean" : 0.0,
    "Replay_median" : 0.0,
    "Replay_75th_percentile" : 0.0,
    "Replay_95th_percentile" : 0.0,
    "Replay_99th_percentile" : 0.0,
    "FlushTime_num_ops" : 1,
    "FlushTime_min" : 70197,
    "FlushTime_max" : 70197,
    "FlushTime_mean" : 70197.0,
    "FlushTime_median" : 70197.0,
    "FlushTime_75th_percentile" : 70197.0,
    "FlushTime_95th_percentile" : 70197.0,
    "FlushTime_99th_percentile" : 70197.0,
    "Delete_num_ops" : 0,
    "Delete_min" : 0,
    "Delete_max" : 0,
    "Delete_mean" : 0.0,
    "Delete_median" : 0.0,
    "Delete_75th_percentile" : 0.0,
    "Delete_95th_percentile" : 0.0,
    "Delete_99th_percentile" : 0.0,
    "splitRequestCount" : 0,
    "splitSuccessCount" : 0,
    "slowGetCount" : 0,
    "Get_num_ops" : 0,
    "Get_min" : 0,
    "Get_max" : 0,
    "Get_mean" : 0.0,
    "Get_median" : 0.0,
    "Get_75th_percentile" : 0.0,
    "Get_95th_percentile" : 0.0,
    "Get_99th_percentile" : 0.0,
    "ScanNext_num_ops" : 0,
    "ScanNext_min" : 0,
    "ScanNext_max" : 0,
    "ScanNext_mean" : 0.0,
    "ScanNext_median" : 0.0,
    "ScanNext_75th_percentile" : 0.0,
    "ScanNext_95th_percentile" : 0.0,
    "ScanNext_99th_percentile" : 0.0,
    "slowPutCount" : 2,
    "slowIncrementCount" : 0,
    "Append_num_ops" : 0,
    "Append_min" : 0,
    "Append_max" : 0,
    "Append_mean" : 0.0,
    "Append_median" : 0.0,
    "Append_75th_percentile" : 0.0,
    "Append_95th_percentile" : 0.0,
    "Append_99th_percentile" : 0.0,
    "SplitTime_num_ops" : 0,
    "SplitTime_min" : 0,
    "SplitTime_max" : 0,
    "SplitTime_mean" : 0.0,
    "SplitTime_median" : 0.0,
    "SplitTime_75th_percentile" : 0.0,
    "SplitTime_95th_percentile" : 0.0,
    "SplitTime_99th_percentile" : 0.0
  } ]
}
2016-08-29 07:10:22,209 INFO  [RS_OPEN_REGION-ip-172-31-103-48:16020-0] regionserver.HRegionServer: STOPPED: Exception refreshing OPENING; region=4ef6634b001b40cd44c40c8406d6d389, context=open_region_progress
2016-08-29 07:10:22,526 INFO  [main-SendThread(172.31.103.112:2181)] zookeeper.ClientCnxn: Opening socket connection to server 172.31.103.112/172.31.103.112:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:10:23,354 INFO  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Opening socket connection to server ip-172-31-103-112.us-west-2.compute.internal/172.31.103.112:2181. Will not attempt to authenticate using SASL (unknown error)
2016-08-29 07:10:24,403 INFO  [ip-172-31-103-48.us-west-2.compute.internal,16020,1472451821818_ChoreService_1] regionserver.HRegionServer$CompactionChecker: Chore: CompactionChecker was stopped
2016-08-29 07:10:24,404 INFO  [ip-172-31-103-48.us-west-2.compute.internal,16020,1472451821818_ChoreService_1] regionserver.HRegionServer$PeriodicMemstoreFlusher: Chore: ip-172-31-103-48.us-west-2.compute.internal,16020,1472451821818-MemstoreFlusherChore was stopped
2016-08-29 07:10:24,783 INFO  [MemStoreFlusher.0] regionserver.MemStoreFlusher: MemStoreFlusher.0 exiting
2016-08-29 07:10:25,524 WARN  [regionserver/ip-172-31-103-48.us-west-2.compute.internal/172.31.103.48:16020-SendThread(ip-172-31-103-112.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Session 0x156d486e2120012 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
 2016-08-29 07:10:25,524 WARN  [main-SendThread(172.31.103.112:2181)] zookeeper.ClientCnxn: Session 0x356d4878aa0001a for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
    

2. Yarn node manager dies less frequently but with similar network connection issues

2016-08-29 15:47:13,951 FATAL nodemanager.NodeManager (NodeManager.java:run(360)) - Error while rebooting NodeStatusUpdater.
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.NoRouteToHostException: No Route to Host from  java.net.UnknownHostException: ip-172-31-103-48: ip-172-31-103-48: unknown error to ip-172-31-103-112.us-west-2.compute.internal:8031 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see:  http://wiki.apache.org/hadoop/NoRouteToHost
 at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.rebootNodeStatusUpdaterAndRegisterWithRM(NodeStatusUpdaterImpl.java:254)
 at org.apache.hadoop.yarn.server.nodemanager.NodeManager$2.run(NodeManager.java:357)
Caused by: java.net.NoRouteToHostException: No Route to Host from  java.net.UnknownHostException: ip-172-31-103-48: ip-172-31-103-48: unknown error to ip-172-31-103-112.us-west-2.compute.internal:8031 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see:  http://wiki.apache.org/hadoop/NoRouteToHost
 at sun.reflect.GeneratedConstructorAccessor34.newInstance(Unknown Source)
 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
 at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801)
 at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:758)
 at org.apache.hadoop.ipc.Client.call(Client.java:1430)
 at org.apache.hadoop.ipc.Client.call(Client.java:1363)
 at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
 at com.sun.proxy.$Proxy82.registerNodeManager(Unknown Source)
 at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:68)
 at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
 at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
 at com.sun.proxy.$Proxy83.registerNodeManager(Unknown Source)
 at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:296)
 at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.rebootNodeStatusUpdaterAndRegisterWithRM(NodeStatusUpdaterImpl.java:246)
 ... 1 more
Caused by: java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
 at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
 at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
 at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:617)
 at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:715)
 at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:378)
 at org.apache.hadoop.ipc.Client.getConnection(Client.java:1492)
 at org.apache.hadoop.ipc.Client.call(Client.java:1402)
 ... 13 more
2016-08-29 15:47:13,961 INFO  mortbay.log (Slf4jLog.java:info(67)) - Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:8042

3. Node becomes unresponsive sometimes(mostly after region server failure and continues to come on and off even though region server is no longer running), you can't login to it. AWS instance check fails. It comes back after few mins/few hours.

There is surely some network misconfiguration in the cluster but above issues happens only on few machines. The cluster is running in VPC. ulimit and nproc are set to 32768 and 65536 respectively. Most host metrics looks normal

7058-screen-shot-2016-08-29-at-44357-pm.png

Any ideas on debugging this would be greatly appreciated.

Thanks

1 ACCEPTED SOLUTION

avatar

It was DHCP failing to see a response from the DHCP server for periods of time. d2 Ubuntu(14.04) instances were using Enhanced Networking and the "ixgbevf" driver 2.11.3-k. 2.11.3-k is below the minimum recommended version 2.14.2 and should be upgraded to 2.16.4. We upgraded the driver to the latest version which seems to have fixed the issue.

Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/sriov-networking.html#enhanced-networking-ubuntu

View solution in original post

3 REPLIES 3

avatar
Master Collaborator

w.r.t. AWS instance check failure, have you looked at

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/TroubleshootingInstances.html

avatar
Master Collaborator

NoRouteToHostException occurred in both region server and node manager logs.

Please check network connectivity.

avatar

It was DHCP failing to see a response from the DHCP server for periods of time. d2 Ubuntu(14.04) instances were using Enhanced Networking and the "ixgbevf" driver 2.11.3-k. 2.11.3-k is below the minimum recommended version 2.14.2 and should be upgraded to 2.16.4. We upgraded the driver to the latest version which seems to have fixed the issue.

Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/sriov-networking.html#enhanced-networking-ubuntu