Reply
Expert Contributor
Posts: 162
Registered: ‎07-29-2013

NN HA on journal quorum. Failover controller sometimes goes down

Hi, We Have CDH 4.3 based on journal quorum.

Failover controller sometimes goes down. I can't find a reason why. We have two namenodes: prod-node015 and prod-node033

prod-node017 is a one of zookeeper instances
Here are some logs:

 

STDERR:

+ exec /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-hdfs/bin/hdfs --config /var/run/cloudera-scm-agent/process/5984-hdfs-FAILOVERCONTROLLER zkfc
Exception in thread "main" java.lang.RuntimeException: ZK Failover Controller failed: Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors.
	at org.apache.hadoop.ha.ZKFailoverController.mainLoop(ZKFailoverController.java:359)
	at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:231)
	at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:58)
	at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:165)
	at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:161)
	at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:452)
	at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:161)
	at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:175)

Role log details:



04:01:53.011    INFO    org.apache.zookeeper.ClientCnxn    
Opening socket connection to server prod-node017.lol.ru/10.66.49.159:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)
04:01:53.011    INFO    org.apache.zookeeper.ClientCnxn    
Socket connection established to prod-node017.lol.ru/10.66.49.159:2181, initiating session
04:01:53.029    INFO    org.apache.zookeeper.ClientCnxn    
Session establishment complete on server prod-node017.lol.ru/10.66.49.159:2181, sessionid = 0x640f2b61aa60001, negotiated timeout = 5000
04:01:53.032    INFO    org.apache.hadoop.ha.ActiveStandbyElector    
Session connected.
04:01:53.044    INFO    org.apache.hadoop.ha.ZKFailoverController    
ZK Election indicated that NameNode at prod-node015.lol.ru/10.66.49.155:8020 should become standby
04:01:53.047    INFO    org.apache.hadoop.ha.ZKFailoverController    
Successfully transitioned NameNode at prod-node015.lol.ru/10.66.49.155:8020 to standby state
04:00:28.295    INFO    org.apache.zookeeper.ClientCnxn    
Unable to read additional data from server sessionid 0x640f2b61aa60001, likely server has closed socket, closing socket connection and attempting reconnect
04:00:28.398    INFO    org.apache.hadoop.ha.ActiveStandbyElector    
Session disconnected. Entering neutral mode...
04:00:28.861    INFO    org.apache.zookeeper.ClientCnxn    
Opening socket connection to server prod-node040.lol.ru/10.66.49.207:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)
04:00:28.862    INFO    org.apache.zookeeper.ClientCnxn    
Socket connection established to prod-node040.lol.ru/10.66.49.207:2181, initiating session
04:00:28.864    INFO    org.apache.zookeeper.ClientCnxn    
Session establishment complete on server prod-node040.lol.ru/10.66.49.207:2181, sessionid = 0x640f2b61aa60001, negotiated timeout = 5000
04:00:28.866    INFO    org.apache.hadoop.ha.ActiveStandbyElector    
Session connected.
04:00:28.875    INFO    org.apache.hadoop.ha.ZKFailoverController    
ZK Election indicated that NameNode at prod-node015.lol.ru/10.66.49.155:8020 should become standby
04:00:28.879    INFO    org.apache.hadoop.ha.ZKFailoverController    
Successfully transitioned NameNode at prod-node015.lol.ru/10.66.49.155:8020 to standby state
04:00:32.986    INFO    org.apache.zookeeper.ClientCnxn    
Unable to read additional data from server sessionid 0x640f2b61aa60001, likely server has closed socket, closing socket connection and attempting reconnect
04:00:33.090    INFO    org.apache.hadoop.ha.ActiveStandbyElector    
Session disconnected. Entering neutral mode...
04:00:33.420    INFO    org.apache.zookeeper.ClientCnxn    
Opening socket connection to server prod-node018.lol.ru/10.66.49.161:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)
04:00:33.421    INFO    org.apache.zookeeper.ClientCnxn    
Socket connection established to prod-node018.lol.ru/10.66.49.161:2181, initiating session
04:00:33.422    INFO    org.apache.zookeeper.ClientCnxn    
Session establishment complete on server prod-node018.lol.ru/10.66.49.161:2181, sessionid = 0x640f2b61aa60001, negotiated timeout = 5000
04:00:33.425    INFO    org.apache.hadoop.ha.ActiveStandbyElector    
Session connected.
04:00:33.433    INFO    org.apache.zookeeper.ClientCnxn    
Unable to read additional data from server sessionid 0x640f2b61aa60001, likely server has closed socket, closing socket connection and attempting reconnect
04:00:33.542    INFO    org.apache.hadoop.ha.ActiveStandbyElector    
Session disconnected. Entering neutral mode...
04:00:34.259    INFO    org.apache.zookeeper.ClientCnxn    
Opening socket connection to server prod-node039.lol.ru/10.66.49.205:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)
04:00:34.259    INFO    org.apache.zookeeper.ClientCnxn    
Socket connection established to prod-node039.lol.ru/10.66.49.205:2181, initiating session
04:00:34.260    INFO    org.apache.zookeeper.ClientCnxn    
Unable to read additional data from server sessionid 0x640f2b61aa60001, likely server has closed socket, closing socket connection and attempting reconnect
04:00:35.312    INFO    org.apache.zookeeper.ClientCnxn    
Opening socket connection to server prod-node031.lol.ru/10.66.49.189:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)
04:00:35.313    WARN    org.apache.zookeeper.ClientCnxn    
Session 0x640f2b61aa60001 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
    at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
04:00:35.503    INFO    org.apache.zookeeper.ClientCnxn    
Opening socket connection to server prod-node017.lol.ru/10.66.49.159:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)
04:00:35.503    WARN    org.apache.zookeeper.ClientCnxn    
Session 0x640f2b61aa60001 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
    at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
04:00:35.607    FATAL    org.apache.hadoop.ha.ActiveStandbyElector    
Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors.
04:00:36.304    INFO    org.apache.zookeeper.ZooKeeper    
Session: 0x640f2b61aa60001 closed
04:00:36.304    FATAL    org.apache.hadoop.ha.ZKFailoverController    
Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors.
04:00:36.305    INFO    org.apache.zookeeper.ClientCnxn    
EventThread shut down
04:00:36.305    INFO    org.apache.hadoop.ipc.Server    
Stopping server on 8019
04:00:36.305    INFO    org.apache.hadoop.ha.ActiveStandbyElector    
Yielding from election
04:00:36.306    INFO    org.apache.hadoop.ha.HealthMonitor    
Stopping HealthMonitor thread
04:00:36.305    INFO    org.apache.hadoop.ipc.Server    
Stopping IPC Server listener on 8019
04:00:36.305    INFO    org.apache.hadoop.ipc.Server    
Stopping IPC Server Responder

 

What is wrong?

 

Posts: 1,894
Kudos: 433
Solutions: 303
Registered: ‎07-31-2013

Re: NN HA on journal quorum. Failover controller sometimes goes down

It is likely that ZK rejected session connections or perhaps had issues staying in good service mode. Exploring the ZK log may help reveal more (if this is continuing to be an issue).
New Contributor
Posts: 5
Registered: ‎02-24-2014

Re: NN HA on journal quorum. Failover controller sometimes goes down

Are there other services runnning in the cluster using zookeeper? Is it specific to the namenode?

Highlighted
Expert Contributor
Posts: 162
Registered: ‎07-29-2013

Re: NN HA on journal quorum. Failover controller sometimes goes down

Yes, there are other services using zookeeper: hive

 

Announcements