Master01 ZKFC log: 2016-04-27 08:32:19,813 INFO zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server zookeeper03/x.x.x.x:2181. Will not attempt to authenticate using SASL (unknown error) 2016-04-27 08:32:19,814 INFO zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(852)) - Socket connection established to zookeeper03/x.x.x.x:2181, initiating session 2016-04-27 08:32:19,815 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(1094)) - Unable to reconnect to ZooKeeper service, session 0x353e06ae5260000 has expired, closing socket connection 2016-04-27 08:32:19,815 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:processWatchEvent(568)) - Session expired. Entering neutral mode and rejoining... 2016-04-27 08:32:19,815 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:reJoinElection(670)) - Trying to re-establish ZK session 2016-04-27 08:32:19,816 INFO zookeeper.ZooKeeper (ZooKeeper.java:(438)) - Initiating client connection, connectString=zookeeper01:2181,zookeeper02:2181,zookeeper03:2181 sessionTimeout=5000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@728edfd1 2016-04-27 08:32:19,818 INFO zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server zookeeper01/x.x.x.x:2181. Will not attempt to authenticate using SASL (unknown error) 2016-04-27 08:32:19,818 INFO zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(852)) - Socket connection established to zookeeper01/x.x.x.x:2181, initiating session 2016-04-27 08:32:19,820 INFO zookeeper.ClientCnxn (ClientCnxn.java:onConnected(1235)) - Session establishment complete on server zookeeper01/x.x.x.x:2181, sessionid = 0x15451ba83ba0ae0, negotiated timeout = 5000 2016-04-27 08:32:19,821 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down 2016-04-27 08:32:19,822 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:processWatchEvent(547)) - Session connected. 2016-04-27 08:32:19,825 INFO ha.ZKFailoverController (ZKFailoverController.java:becomeStandby(484)) - ZK Election indicated that NameNode at master01/x.x.x.x:8020 should become standby 2016-04-27 08:32:24,833 ERROR ha.ZKFailoverController (ZKFailoverController.java:becomeStandby(492)) - Couldn't transition NameNode at master01/x.x.x.x:8020 to standby state java.net.SocketTimeoutException: Call From master01/x.x.x.x to master01:8020 failed on socket timeout exception: java.net.SocketTimeoutException: 5000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/x.x.x.x:39089 remote=master01/x.x.x.x:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751) at org.apache.hadoop.ipc.Client.call(Client.java:1431) at org.apache.hadoop.ipc.Client.call(Client.java:1358) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy9.transitionToStandby(Unknown Source) at org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112) at org.apache.hadoop.ha.ZKFailoverController.becomeStandby(ZKFailoverController.java:488) at org.apache.hadoop.ha.ZKFailoverController.access$1000(ZKFailoverController.java:61) at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.becomeStandby(ZKFailoverController.java:878) at org.apache.hadoop.ha.ActiveStandbyElector.becomeStandby(ActiveStandbyElector.java:911) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:428) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: java.net.SocketTimeoutException: 5000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/x.x.x.x:39089 remote=master01/x.x.x.x:8020] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.FilterInputStream.read(FilterInputStream.java:133) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:518) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1082) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:977) 2016-04-27 08:33:04,980 WARN ha.HealthMonitor (HealthMonitor.java:doHealthChecks(211)) - Transport-level exception trying to monitor health of NameNode at master01/x.x.x.x:8020: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/x.x.x.x:47802 remote=master01/x.x.x.x:8020] Call From master01/x.x.x.x to master01:8020 failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/x.x.x.x:47802 remote=master01/x.x.x.x:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout 2016-04-27 08:33:04,980 INFO ha.HealthMonitor (HealthMonitor.java:enterState(249)) - Entering state SERVICE_NOT_RESPONDING 2016-04-27 08:33:04,981 INFO ha.ZKFailoverController (ZKFailoverController.java:setLastHealthState(852)) - Local service NameNode at master01/x.x.x.x:8020 entered state: SERVICE_NOT_RESPONDING 2016-04-27 08:33:04,981 INFO ha.ZKFailoverController (ZKFailoverController.java:recheckElectability(768)) - Quitting master election for NameNode at master01/x.x.x.x:8020 and marking that fencing is necessary 2016-04-27 08:33:04,981 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:quitElection(354)) - Yielding from election 2016-04-27 08:33:04,986 INFO zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - Session: 0x15451ba83ba0ae0 closed 2016-04-27 08:33:04,986 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down 2016-04-27 08:34:37,080 WARN ha.HealthMonitor (HealthMonitor.java:doHealthChecks(211)) - Transport-level exception trying to monitor health of NameNode at master01/x.x.x.x:8020: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/x.x.x.x:39152 remote=master01/x.x.x.x:8020] Call From master01/x.x.x.x to master01:8020 failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/x.x.x.x:39152 remote=master01/x.x.x.x:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout 2016-04-27 08:35:23,125 WARN ha.HealthMonitor (HealthMonitor.java:doHealthChecks(211)) - Transport-level exception trying to monitor health of NameNode at master01/x.x.x.x:8020: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/x.x.x.x:39162 remote=master01/x.x.x.x:8020] Call From master01/x.x.x.x to master01:8020 failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/x.x.x.x:39162 remote=master01/x.x.x.x:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout ... more of same warning ... When restart NN ... 2016-04-27 12:26:23,125 WARN ha.HealthMonitor (HealthMonitor.java:doHealthChecks(211)) - Transport-level exception trying to monitor health of NameNode at master01/x.x.x.x:8020: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/x.x.x.x:39162 remote=master01/x.x.x.x:8020] Call From master01/x.x.x.x to master01:8020 failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/x.x.x.x:39162 remote=master01/x.x.x.x:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout 2016-04-27 12:27:21,961 WARN ha.HealthMonitor (HealthMonitor.java:doHealthChecks(211)) - Transport-level exception trying to monitor health of NameNode at master01/x.x.x.x:8020: null NameNode still not started at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1907) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getServiceStatus(NameNodeRpcServer.java:1495) at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.getServiceStatus(HAServiceProtocolServerSideTranslatorPB.java:131) at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4464) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131) 2016-04-27 12:27:22,975 INFO ha.HealthMonitor (HealthMonitor.java:enterState(249)) - Entering state SERVICE_HEALTHY 2016-04-27 12:27:22,975 INFO ha.ZKFailoverController (ZKFailoverController.java:setLastHealthState(852)) - Local service NameNode at master01/x.x.x.x:8020 entered state: SERVICE_HEALTHY 2016-04-27 12:27:22,976 INFO zookeeper.ZooKeeper (ZooKeeper.java:(438)) - Initiating client connection, connectString=zookeeper01:2181,zookeeper02:2181,zookeeper03:2181 sessionTimeout=5000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@24f3992 2016-04-27 12:27:22,978 INFO zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server zookeeper03/x.x.x.x:2181. Will not attempt to authenticate using SASL (unknown error) 2016-04-27 12:27:22,978 INFO zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(852)) - Socket connection established to zookeeper03/x.x.x.x:2181, initiating session 2016-04-27 12:27:22,981 INFO zookeeper.ClientCnxn (ClientCnxn.java:onConnected(1235)) - Session establishment complete on server zookeeper03/x.x.x.x:2181, sessionid = 0x35451bb0f752402, negotiated timeout = 5000 2016-04-27 12:27:22,984 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:processWatchEvent(547)) - Session connected. 2016-04-27 12:27:22,985 INFO ha.ZKFailoverController (ZKFailoverController.java:becomeStandby(484)) - ZK Election indicated that NameNode at master01/x.x.x.x:8020 should become standby 2016-04-27 12:27:22,995 INFO ha.ZKFailoverController (ZKFailoverController.java:becomeStandby(489)) - Successfully transitioned NameNode at master01/x.x.x.x:8020 to standby state Master02 ZKFC Log: 2016-04-27 08:30:44,094 WARN ha.FailoverController (FailoverController.java:tryGracefulFence(178)) - Unable to gracefully make NameNode at master01/x.x.x.x:8020 standby (unable to connect) org.apache.hadoop.net.ConnectTimeoutException: Call From master02/x.x.x.x to master01:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=master01/x.x.x.x:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751) at org.apache.hadoop.ipc.Client.call(Client.java:1431) at org.apache.hadoop.ipc.Client.call(Client.java:1358) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy9.transitionToStandby(Unknown Source) at org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112) at org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172) at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:514) at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505) at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61) at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892) at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:902) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:801) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=master01/x.x.x.x:8020] at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:612) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:710) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:373) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1493) at org.apache.hadoop.ipc.Client.call(Client.java:1397) ... 14 more 2016-04-27 08:30:44,095 INFO ha.NodeFencer (NodeFencer.java:fence(91)) - ====== Beginning Service Fencing Process... ====== 2016-04-27 08:30:44,095 INFO ha.NodeFencer (NodeFencer.java:fence(94)) - Trying method 1/1: org.apache.hadoop.ha.ShellCommandFencer(/bin/true) 2016-04-27 08:30:44,099 INFO ha.ShellCommandFencer (ShellCommandFencer.java:tryFence(99)) - Launched fencing command '/bin/true' with pid 2968 2016-04-27 08:30:44,104 INFO ha.NodeFencer (NodeFencer.java:fence(98)) - ====== Fencing successful by method org.apache.hadoop.ha.ShellCommandFencer(/bin/true) ====== 2016-04-27 08:30:44,105 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:writeBreadCrumbNode(824)) - Writing znode /hadoop-ha/masterHA/ActiveBreadCrumb to indicate that the local node is the most recent active... 2016-04-27 08:30:44,108 INFO ha.ZKFailoverController (ZKFailoverController.java:becomeActive(380)) - Trying to make NameNode at master02/x.x.x.x:8020 active... 2016-04-27 08:30:46,475 INFO ha.ZKFailoverController (ZKFailoverController.java:becomeActive(387)) - Successfully transitioned NameNode at master02/x.x.x.x:8020 to active state