Reply
Highlighted
upo
Explorer
Posts: 15
Registered: ‎05-07-2015

Namenode failover frequently

I am running CDH5.4.0 under CM5.4 for a year time, and no namenode auto failover occur except only one time the Namenode heap size was too low and I tuned it to 32GB.

Recently, an auto failover occurred one or two times during one week, and I found nothing useful in the logs of namenode, zkfc and zookeeper. Anyone could apply any clue or doubt for me to trace this problems?

 

Posts: 1,824
Kudos: 406
Solutions: 292
Registered: ‎07-31-2013

Re: Namenode failover frequently

I'd start by looking at the FailoverController logs of the NameNode that transitioned from its Active -> Standby state. The failover is driven purely from the FCs and you'll see the reason in its logs on why it determined the NN is unfit to be active anymore before it triggers the failover.
New Contributor
Posts: 5
Registered: ‎05-09-2016

Re: Namenode failover frequently

Hi Harsh,

 

I have the same problem in my cluster running CDH 5.7.2.

Could you please guide me in this. What can be done to avoid this.

 

FAILOVER CONTROLER LOGS:

------------------------------------------

 

2016-10-12 12:36:57,428 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at namenode1:8022: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/x.x.x.x:54851 remote=namenode1:8022] Call From namenode1 to namenode1:8022 failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/x.x.x.x:54851 remote=namenode1:8022]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
2016-10-12 12:36:57,428 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_NOT_RESPONDING
2016-10-12 12:36:57,429 INFO org.apache.hadoop.ha.ZKFailoverController: Local service NameNode at namenode1:8022 entered state: SERVICE_NOT_RESPONDING
2016-10-12 12:37:17,453 WARN org.apache.hadoop.hdfs.tools.DFSZKFailoverController: Can't get local NN thread dump due to Read timed out
2016-10-12 12:37:17,453 INFO org.apache.hadoop.ha.ZKFailoverController: Quitting master election for NameNode at namenode1:8022 and marking that fencing is necessary
2016-10-12 12:37:17,453 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election
2016-10-12 12:37:17,456 INFO org.apache.zookeeper.ZooKeeper: Session: 0x2572780883a39b5 closed
2016-10-12 12:37:17,456 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x2572780883a39b5
2016-10-12 12:37:17,456 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2016-10-12 12:37:33,291 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_HEALTHY
2016-10-12 12:37:33,291 INFO org.apache.hadoop.ha.ZKFailoverController: Local service NameNode at namenode1:8022 entered state: SERVICE_HEALTHY
2016-10-12 12:37:33,295 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=namenode1:2181,namenode2:2181,resourcemanger1:2181 sessionTimeout=5000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@4d2b3d47
2016-10-12 12:37:33,303 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server namenode2:2181. Will not attempt to authenticate using SASL (unknown error)
2016-10-12 12:37:33,303 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /x.x.x.x:60657, server: namenode2:2181
2016-10-12 12:37:33,304 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server namenode2:2181, sessionid = 0x357278087a35a6b, negotiated timeout = 5000
2016-10-12 12:37:33,308 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
2016-10-12 12:37:33,311 INFO org.apache.hadoop.ha.ZKFailoverController: ZK Election indicated that NameNode at namenode1:8022 should become standby
2016-10-12 12:37:33,595 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at namenode1:8022 to standby state

Cloudera Employee
Posts: 47
Registered: ‎08-16-2016

Re: Namenode failover frequently

Your ZKFC log indicates it can't connect to the local namenode. It might happen due to several scenarios. One is the NN is terminated. The other could be that NN is under heavy load or doing crazy garbage collection so that it stops responding. You should check the NN log or NN status to cross reference.

Announcements