Welcome to the Cloudera Community

Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Who agreed with this topic

NN HA on journal quorum. Failover controller sometimes goes down

avatar
Expert Contributor

Hi, We Have CDH 4.3 based on journal quorum.

Failover controller sometimes goes down. I can't find a reason why. We have two namenodes: prod-node015 and prod-node033

prod-node017 is a one of zookeeper instances
Here are some logs:

 

STDERR:

+ exec /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-hdfs/bin/hdfs --config /var/run/cloudera-scm-agent/process/5984-hdfs-FAILOVERCONTROLLER zkfc
Exception in thread "main" java.lang.RuntimeException: ZK Failover Controller failed: Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors.
	at org.apache.hadoop.ha.ZKFailoverController.mainLoop(ZKFailoverController.java:359)
	at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:231)
	at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:58)
	at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:165)
	at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:161)
	at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:452)
	at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:161)
	at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:175)

Role log details:



04:01:53.011    INFO    org.apache.zookeeper.ClientCnxn    
Opening socket connection to server prod-node017.lol.ru/10.66.49.159:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)
04:01:53.011    INFO    org.apache.zookeeper.ClientCnxn    
Socket connection established to prod-node017.lol.ru/10.66.49.159:2181, initiating session
04:01:53.029    INFO    org.apache.zookeeper.ClientCnxn    
Session establishment complete on server prod-node017.lol.ru/10.66.49.159:2181, sessionid = 0x640f2b61aa60001, negotiated timeout = 5000
04:01:53.032    INFO    org.apache.hadoop.ha.ActiveStandbyElector    
Session connected.
04:01:53.044    INFO    org.apache.hadoop.ha.ZKFailoverController    
ZK Election indicated that NameNode at prod-node015.lol.ru/10.66.49.155:8020 should become standby
04:01:53.047    INFO    org.apache.hadoop.ha.ZKFailoverController    
Successfully transitioned NameNode at prod-node015.lol.ru/10.66.49.155:8020 to standby state
04:00:28.295    INFO    org.apache.zookeeper.ClientCnxn    
Unable to read additional data from server sessionid 0x640f2b61aa60001, likely server has closed socket, closing socket connection and attempting reconnect
04:00:28.398    INFO    org.apache.hadoop.ha.ActiveStandbyElector    
Session disconnected. Entering neutral mode...
04:00:28.861    INFO    org.apache.zookeeper.ClientCnxn    
Opening socket connection to server prod-node040.lol.ru/10.66.49.207:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)
04:00:28.862    INFO    org.apache.zookeeper.ClientCnxn    
Socket connection established to prod-node040.lol.ru/10.66.49.207:2181, initiating session
04:00:28.864    INFO    org.apache.zookeeper.ClientCnxn    
Session establishment complete on server prod-node040.lol.ru/10.66.49.207:2181, sessionid = 0x640f2b61aa60001, negotiated timeout = 5000
04:00:28.866    INFO    org.apache.hadoop.ha.ActiveStandbyElector    
Session connected.
04:00:28.875    INFO    org.apache.hadoop.ha.ZKFailoverController    
ZK Election indicated that NameNode at prod-node015.lol.ru/10.66.49.155:8020 should become standby
04:00:28.879    INFO    org.apache.hadoop.ha.ZKFailoverController    
Successfully transitioned NameNode at prod-node015.lol.ru/10.66.49.155:8020 to standby state
04:00:32.986    INFO    org.apache.zookeeper.ClientCnxn    
Unable to read additional data from server sessionid 0x640f2b61aa60001, likely server has closed socket, closing socket connection and attempting reconnect
04:00:33.090    INFO    org.apache.hadoop.ha.ActiveStandbyElector    
Session disconnected. Entering neutral mode...
04:00:33.420    INFO    org.apache.zookeeper.ClientCnxn    
Opening socket connection to server prod-node018.lol.ru/10.66.49.161:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)
04:00:33.421    INFO    org.apache.zookeeper.ClientCnxn    
Socket connection established to prod-node018.lol.ru/10.66.49.161:2181, initiating session
04:00:33.422    INFO    org.apache.zookeeper.ClientCnxn    
Session establishment complete on server prod-node018.lol.ru/10.66.49.161:2181, sessionid = 0x640f2b61aa60001, negotiated timeout = 5000
04:00:33.425    INFO    org.apache.hadoop.ha.ActiveStandbyElector    
Session connected.
04:00:33.433    INFO    org.apache.zookeeper.ClientCnxn    
Unable to read additional data from server sessionid 0x640f2b61aa60001, likely server has closed socket, closing socket connection and attempting reconnect
04:00:33.542    INFO    org.apache.hadoop.ha.ActiveStandbyElector    
Session disconnected. Entering neutral mode...
04:00:34.259    INFO    org.apache.zookeeper.ClientCnxn    
Opening socket connection to server prod-node039.lol.ru/10.66.49.205:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)
04:00:34.259    INFO    org.apache.zookeeper.ClientCnxn    
Socket connection established to prod-node039.lol.ru/10.66.49.205:2181, initiating session
04:00:34.260    INFO    org.apache.zookeeper.ClientCnxn    
Unable to read additional data from server sessionid 0x640f2b61aa60001, likely server has closed socket, closing socket connection and attempting reconnect
04:00:35.312    INFO    org.apache.zookeeper.ClientCnxn    
Opening socket connection to server prod-node031.lol.ru/10.66.49.189:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)
04:00:35.313    WARN    org.apache.zookeeper.ClientCnxn    
Session 0x640f2b61aa60001 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
    at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
04:00:35.503    INFO    org.apache.zookeeper.ClientCnxn    
Opening socket connection to server prod-node017.lol.ru/10.66.49.159:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)
04:00:35.503    WARN    org.apache.zookeeper.ClientCnxn    
Session 0x640f2b61aa60001 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
    at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
04:00:35.607    FATAL    org.apache.hadoop.ha.ActiveStandbyElector    
Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors.
04:00:36.304    INFO    org.apache.zookeeper.ZooKeeper    
Session: 0x640f2b61aa60001 closed
04:00:36.304    FATAL    org.apache.hadoop.ha.ZKFailoverController    
Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors.
04:00:36.305    INFO    org.apache.zookeeper.ClientCnxn    
EventThread shut down
04:00:36.305    INFO    org.apache.hadoop.ipc.Server    
Stopping server on 8019
04:00:36.305    INFO    org.apache.hadoop.ha.ActiveStandbyElector    
Yielding from election
04:00:36.306    INFO    org.apache.hadoop.ha.HealthMonitor    
Stopping HealthMonitor thread
04:00:36.305    INFO    org.apache.hadoop.ipc.Server    
Stopping IPC Server listener on 8019
04:00:36.305    INFO    org.apache.hadoop.ipc.Server    
Stopping IPC Server Responder

 

What is wrong?

 

Who agreed with this topic