Reply
Contributor
Posts: 29
Registered: ‎04-08-2016

CDH 5.7.0: HDFS High Availabilty Name Servers randomly fails over

We have two name nodes (inf-v1 & inf-v2) and occasionally we've observed the primary name service became standby, and the standby became active (with no changes by us).

 

how to determine the cause of this problem?

 

2016-06-30 10:48:35,376 INFO org.apache.hadoop.hdfs.tools.DFSZKFailoverController: -- Local NN thread dump -- 
Process Thread Dump: 
15 active threads
Thread 23 (Timer-0):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 1
  Stack:
    java.lang.Object.wait(Native Method)
    java.util.TimerThread.mainLoop(Unknown Source)
    java.util.TimerThread.run(Unknown Source)
Thread 22 (229116977@qtp-1041125371-1 - Acceptor0 HttpServer2$SelectChannelConnectorWithSafeStartup@inf-v2.archive.com:50070):
  State: RUNNABLE
  Blocked count: 1
  Waited count: 1
  Stack:
    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    sun.nio.ch.EPollArrayWrapper.poll(Unknown Source)
    sun.nio.ch.EPollSelectorImpl.doSelect(Unknown Source)
    sun.nio.ch.SelectorImpl.lockAndDoSelect(Unknown Source)
    sun.nio.ch.SelectorImpl.select(Unknown Source)
    org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:511)
    org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:193)
    org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)
    org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:708)
    org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Thread 21 (53050680@qtp-1041125371-0):
  State: RUNNABLE
  Blocked count: 2
  Waited count: 1
  Stack:
    sun.management.ThreadImpl.getThreadInfo1(Native Method)
    sun.management.ThreadImpl.getThreadInfo(Unknown Source)
    sun.management.ThreadImpl.getThreadInfo(Unknown Source)
    org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:165)
    org.apache.hadoop.http.HttpServer2$StackServlet.doGet(HttpServer2.java:1155)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
    org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
    org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
    org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
    org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1286)
    org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
    org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
    org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
    org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
    org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
Thread 20 (pool-2-thread-1):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 1
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
    java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source)
    java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(Unknown Source)
    java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(Unknown Source)
    java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
    java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    java.lang.Thread.run(Unknown Source)
Thread 18 (org.apache.hadoop.util.JvmPauseMonitor$Monitor@28badca2):
  State: TIMED_WAITING
  Blocked count: 1
  Waited count: 7
  Stack:
    java.lang.Thread.sleep(Native Method)
    org.apache.hadoop.util.JvmPauseMonitor$Monitor.run(JvmPauseMonitor.java:181)
    java.lang.Thread.run(Unknown Source)
Thread 17 (Timer for 'NameNode' metrics system):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 1
  Stack:
    java.lang.Object.wait(Native Method)
    java.util.TimerThread.mainLoop(Unknown Source)
    java.util.TimerThread.run(Unknown Source)
Thread 14 (process reaper):
  State: TIMED_WAITING
  Blocked count: 1
  Waited count: 2
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
    java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(Unknown Source)
    java.util.concurrent.SynchronousQueue$TransferStack.transfer(Unknown Source)
    java.util.concurrent.SynchronousQueue.poll(Unknown Source)
    java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
    java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    java.lang.Thread.run(Unknown Source)
Thread 13 (event filterer):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 74
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
    java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source)
    java.util.concurrent.LinkedBlockingQueue.poll(Unknown Source)
    com.cloudera.cmf.eventcatcher.client.logs.LogEventProcessor.runFiltering(LogEventProcessor.java:132)
    com.cloudera.cmf.eventcatcher.client.logs.LogEventProcessor.access$000(LogEventProcessor.java:28)
    com.cloudera.cmf.eventcatcher.client.logs.LogEventProcessor$1.run(LogEventProcessor.java:81)
Thread 12 (RMI TCP Accept-0):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
    java.net.PlainSocketImpl.socketAccept(Native Method)
    java.net.AbstractPlainSocketImpl.accept(Unknown Source)
    java.net.ServerSocket.implAccept(Unknown Source)
    java.net.ServerSocket.accept(Unknown Source)
    sun.management.jmxremote.LocalRMIServerSocketFactory$1.accept(Unknown Source)
    sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(Unknown Source)
    sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(Unknown Source)
    java.lang.Thread.run(Unknown Source)
Thread 11 (RMI TCP Accept-8064):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
    java.net.PlainSocketImpl.socketAccept(Native Method)
    java.net.AbstractPlainSocketImpl.accept(Unknown Source)
    java.net.ServerSocket.implAccept(Unknown Source)
    java.net.ServerSocket.accept(Unknown Source)
    sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(Unknown Source)
    sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(Unknown Source)
    java.lang.Thread.run(Unknown Source)
Thread 10 (RMI TCP Accept-0):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
    java.net.PlainSocketImpl.socketAccept(Native Method)
    java.net.AbstractPlainSocketImpl.accept(Unknown Source)
    java.net.ServerSocket.implAccept(Unknown Source)
    java.net.ServerSocket.accept(Unknown Source)
    sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(Unknown Source)
    sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(Unknown Source)
    java.lang.Thread.run(Unknown Source)
Thread 5 (Signal Dispatcher):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
Thread 3 (Finalizer):
  State: WAITING
  Blocked count: 5
  Waited count: 2
  Waiting on java.lang.ref.ReferenceQueue$Lock@d17ffd
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.ref.ReferenceQueue.remove(Unknown Source)
    java.lang.ref.ReferenceQueue.remove(Unknown Source)
    java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)
Thread 2 (Reference Handler):
  State: WAITING
  Blocked count: 1
  Waited count: 2
  Waiting on java.lang.ref.Reference$Lock@3a29bb76
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:503)
    java.lang.ref.Reference$ReferenceHandler.run(Unknown Source)
Thread 1 (main):
  State: RUNNABLE
  Blocked count: 4
  Waited count: 3
  Stack:
    java.util.zip.ZipFile.getEntry(Native Method)
    java.util.zip.ZipFile.getEntry(Unknown Source)
    java.util.jar.JarFile.getEntry(Unknown Source)
    java.util.jar.JarFile.getJarEntry(Unknown Source)
    sun.misc.URLClassPath$JarLoader.getResource(Unknown Source)
    sun.misc.URLClassPath$JarLoader.findResource(Unknown Source)
    sun.misc.URLClassPath.findResource(Unknown Source)
    java.net.URLClassLoader$2.run(Unknown Source)
    java.net.URLClassLoader$2.run(Unknown Source)
    java.security.AccessController.doPrivileged(Native Method)
    java.net.URLClassLoader.findResource(Unknown Source)
    java.lang.ClassLoader.getResource(Unknown Source)
    java.net.URLClassLoader.getResourceAsStream(Unknown Source)
    org.apache.xerces.parsers.SecuritySupport$6.run(Unknown Source)
    java.security.AccessController.doPrivileged(Native Method)
    org.apache.xerces.parsers.SecuritySupport.getResourceAsStream(Unknown Source)
    org.apache.xerces.parsers.ObjectFactory.findJarServiceProvider(Unknown Source)
    org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)
    org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)
    org.apache.xerces.parsers.DOMParser.<init>(Unknown Source)

 -- Local NN thread dump -- 

 

Highlighted
Posts: 1,903
Kudos: 435
Solutions: 305
Registered: ‎07-31-2013

Re: CDH 5.7.0: HDFS High Availabilty Name Servers randomly fails over

Could you pastebin your fuller NameNode and FailoverController logs both from the time period this occurred during?

The auto thread dump capture is strangely not showing any of the handler threads, and the main thread of the NameNode looks like its still busy loading some jars or config files, which is rather strange a state to be in if it was already functional. Its not therefore completing the picture of the cause.
Contributor
Posts: 29
Registered: ‎04-08-2016

Re: CDH 5.7.0: HDFS High Availabilty Name Servers randomly fails over

Posts: 1,903
Kudos: 435
Solutions: 305
Registered: ‎07-31-2013

Re: CDH 5.7.0: HDFS High Availabilty Name Servers randomly fails over

Thank you. From the looks of it, someone/something manually restarted the NameNode process on both your machines at ~10:47:

2016-06-30 10:47:25,315 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15: SIGTERM
2016-06-30 10:47:25,322 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
2016-06-30 10:48:29,563 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
2016-06-30 10:48:29,623 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]

--

2016-06-30 10:47:25,134 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15: SIGTERM
2016-06-30 10:47:25,137 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
2016-06-30 10:48:29,790 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
2016-06-30 10:48:29,877 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]

So you're facing more than just a failover - you're observing a whole new NameNode with just the active role switched around due to a newer election.

Are you aware of any administrative activity that went on in CM during this time? You can view the Audits pane under each service to find if someone issued a restart during that timeframe, for ex. The SIGTERM is typically the standard way of shutdown, i.e. a regular kill signal sent to the process such as is done when someone hits the Stop or Restart buttons in CM or manually sends it to the PID outside of CM (have to be root or hdfs user to successfully do that).
Contributor
Posts: 29
Registered: ‎04-08-2016

Re: CDH 5.7.0: HDFS High Availabilty Name Servers randomly fails over

We are sure the name nodes failed over, but its difficult to find this in the logs as there is so much noise in there.

 

(that and we were doing testing with starting / stopping the system)

 

are there key words to seach for when this happens?

 

we also cannot find any kind of history of failovers in the SCM portal.

 

the UI is really lacking in this area.

 

any suggestions on how to track down a proper instance of the failover event would be appreciated?  (keyword in log, etc.)

 

Posts: 1,903
Kudos: 435
Solutions: 305
Registered: ‎07-31-2013

Re: CDH 5.7.0: HDFS High Availabilty Name Servers randomly fails over

The FailoverController is entirely responsible for performing the active switch over any standby NNs. If you'd like to track what causes a FailoverController to elect a new active, the logs usually follow this pattern, such as in your case that of a restart:

The ZKFC runs a simple RPC periodically against its local NameNode to check if its still alive (its alive if the RPC responds within 45s). When the RPC fails due to whatever reason, such a log is printed:

2016-06-30 10:48:35,096 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at inf-v2.archive.com/10.10.10.83:8022: java.net.ConnectException: Connection refused Call From inf-v2.archive.com/10.10.10.83 to inf-v2.archive.com:8022 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused


(In this case above, the RPC failed cause there was nothing listening on the NN's port, indicating that the NN was down when the call was done)

This event triggers a shift in election states:

2016-06-30 10:48:35,101 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_NOT_RESPONDING
2016-06-30 10:48:35,101 INFO org.apache.hadoop.ha.ZKFailoverController: Local service NameNode at inf-v2.archive.com/10.10.10.83:8022 entered state: SERVICE_NOT_RESPONDING
2016-06-30 10:48:35,377 INFO org.apache.hadoop.ha.ZKFailoverController: Quitting master election for NameNode at inf-v2.archive.com/10.10.10.83:8022 and marking that fencing is necessary
2016-06-30 10:48:35,377 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election

The ZKFC does this to effectively prevent its NN from being considered as the new active, and triggering a new active election that will choose another NameNode via ZK but by first excluding itself from it (cause according to this ZKFC, its local NameNode is not reachable per its periodic health test).

At this point the remaining member in the election (i.e. the other NameNode's ZKFC) naturally gets elected:

2016-06-30 10:48:41,313 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_HEALTHY
2016-06-30 10:48:41,313 INFO org.apache.hadoop.ha.ZKFailoverController: Local service NameNode at inf-v1.archive.com/10.10.10.82:8022 entered state: SERVICE_HEALTHY
2016-06-30 10:48:42,550 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /hadoop-ha/nameservice1/ActiveBreadCrumb to indicate that the local node is the most recent active...
2016-06-30 10:48:42,564 INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode at inf-v1.archive.com/10.10.10.82:8022 active...
2016-06-30 10:48:43,473 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at inf-v1.archive.com/10.10.10.82:8022 to active state

While your case is not exactly that of a failover, and more of a restart, I hope the above still helps in understanding things better.

P.s. Note that from the HDFS architecture POV, there's no primary NameNode host. Either NameNode can become the active, there is no designation - so each full restart of HDFS may bring up a different NameNode treated as active and its incorrect to assume that the same NameNode must always be active.

If you're seeing unnecessary fail-overs which is being caused by NN crashes, then CM does report an unexpected exits warning on the specific NN role.