28806
DISCUSSIONS
102200
MEMBERS
3161
ARTICLES
Created 10-21-2013 07:44 AM
Hi, We Have CDH 4.3 based on journal quorum.
Failover controller sometimes goes down. I can't find a reason why. We have two namenodes: prod-node015 and prod-node033
prod-node017 is a one of zookeeper instances
Here are some logs:
STDERR:
+ exec /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-hdfs/bin/hdfs --config /var/run/cloudera-scm-agent/process/5984-hdfs-FAILOVERCONTROLLER zkfc Exception in thread "main" java.lang.RuntimeException: ZK Failover Controller failed: Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors. at org.apache.hadoop.ha.ZKFailoverController.mainLoop(ZKFailoverController.java:359) at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:231) at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:58) at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:165) at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:161) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:452) at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:161) at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:175)
Role log details:
04:01:53.011 INFO org.apache.zookeeper.ClientCnxn
Opening socket connection to server prod-node017.lol.ru/10.66.49.159:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)
04:01:53.011 INFO org.apache.zookeeper.ClientCnxn
Socket connection established to prod-node017.lol.ru/10.66.49.159:2181, initiating session
04:01:53.029 INFO org.apache.zookeeper.ClientCnxn
Session establishment complete on server prod-node017.lol.ru/10.66.49.159:2181, sessionid = 0x640f2b61aa60001, negotiated timeout = 5000
04:01:53.032 INFO org.apache.hadoop.ha.ActiveStandbyElector
Session connected.
04:01:53.044 INFO org.apache.hadoop.ha.ZKFailoverController
ZK Election indicated that NameNode at prod-node015.lol.ru/10.66.49.155:8020 should become standby
04:01:53.047 INFO org.apache.hadoop.ha.ZKFailoverController
Successfully transitioned NameNode at prod-node015.lol.ru/10.66.49.155:8020 to standby state
04:00:28.295 INFO org.apache.zookeeper.ClientCnxn
Unable to read additional data from server sessionid 0x640f2b61aa60001, likely server has closed socket, closing socket connection and attempting reconnect
04:00:28.398 INFO org.apache.hadoop.ha.ActiveStandbyElector
Session disconnected. Entering neutral mode...
04:00:28.861 INFO org.apache.zookeeper.ClientCnxn
Opening socket connection to server prod-node040.lol.ru/10.66.49.207:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)
04:00:28.862 INFO org.apache.zookeeper.ClientCnxn
Socket connection established to prod-node040.lol.ru/10.66.49.207:2181, initiating session
04:00:28.864 INFO org.apache.zookeeper.ClientCnxn
Session establishment complete on server prod-node040.lol.ru/10.66.49.207:2181, sessionid = 0x640f2b61aa60001, negotiated timeout = 5000
04:00:28.866 INFO org.apache.hadoop.ha.ActiveStandbyElector
Session connected.
04:00:28.875 INFO org.apache.hadoop.ha.ZKFailoverController
ZK Election indicated that NameNode at prod-node015.lol.ru/10.66.49.155:8020 should become standby
04:00:28.879 INFO org.apache.hadoop.ha.ZKFailoverController
Successfully transitioned NameNode at prod-node015.lol.ru/10.66.49.155:8020 to standby state
04:00:32.986 INFO org.apache.zookeeper.ClientCnxn
Unable to read additional data from server sessionid 0x640f2b61aa60001, likely server has closed socket, closing socket connection and attempting reconnect
04:00:33.090 INFO org.apache.hadoop.ha.ActiveStandbyElector
Session disconnected. Entering neutral mode...
04:00:33.420 INFO org.apache.zookeeper.ClientCnxn
Opening socket connection to server prod-node018.lol.ru/10.66.49.161:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)
04:00:33.421 INFO org.apache.zookeeper.ClientCnxn
Socket connection established to prod-node018.lol.ru/10.66.49.161:2181, initiating session
04:00:33.422 INFO org.apache.zookeeper.ClientCnxn
Session establishment complete on server prod-node018.lol.ru/10.66.49.161:2181, sessionid = 0x640f2b61aa60001, negotiated timeout = 5000
04:00:33.425 INFO org.apache.hadoop.ha.ActiveStandbyElector
Session connected.
04:00:33.433 INFO org.apache.zookeeper.ClientCnxn
Unable to read additional data from server sessionid 0x640f2b61aa60001, likely server has closed socket, closing socket connection and attempting reconnect
04:00:33.542 INFO org.apache.hadoop.ha.ActiveStandbyElector
Session disconnected. Entering neutral mode...
04:00:34.259 INFO org.apache.zookeeper.ClientCnxn
Opening socket connection to server prod-node039.lol.ru/10.66.49.205:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)
04:00:34.259 INFO org.apache.zookeeper.ClientCnxn
Socket connection established to prod-node039.lol.ru/10.66.49.205:2181, initiating session
04:00:34.260 INFO org.apache.zookeeper.ClientCnxn
Unable to read additional data from server sessionid 0x640f2b61aa60001, likely server has closed socket, closing socket connection and attempting reconnect
04:00:35.312 INFO org.apache.zookeeper.ClientCnxn
Opening socket connection to server prod-node031.lol.ru/10.66.49.189:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)
04:00:35.313 WARN org.apache.zookeeper.ClientCnxn
Session 0x640f2b61aa60001 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
04:00:35.503 INFO org.apache.zookeeper.ClientCnxn
Opening socket connection to server prod-node017.lol.ru/10.66.49.159:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)
04:00:35.503 WARN org.apache.zookeeper.ClientCnxn
Session 0x640f2b61aa60001 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
04:00:35.607 FATAL org.apache.hadoop.ha.ActiveStandbyElector
Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors.
04:00:36.304 INFO org.apache.zookeeper.ZooKeeper
Session: 0x640f2b61aa60001 closed
04:00:36.304 FATAL org.apache.hadoop.ha.ZKFailoverController
Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors.
04:00:36.305 INFO org.apache.zookeeper.ClientCnxn
EventThread shut down
04:00:36.305 INFO org.apache.hadoop.ipc.Server
Stopping server on 8019
04:00:36.305 INFO org.apache.hadoop.ha.ActiveStandbyElector
Yielding from election
04:00:36.306 INFO org.apache.hadoop.ha.HealthMonitor
Stopping HealthMonitor thread
04:00:36.305 INFO org.apache.hadoop.ipc.Server
Stopping IPC Server listener on 8019
04:00:36.305 INFO org.apache.hadoop.ipc.Server
Stopping IPC Server Responder
What is wrong?