Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hadoop NameNodes in HA crash every time after a lot of postponedMisreplicatedBlocks

avatar

We have a cluster NOT managed by Cloudera Manager (I wish, I could change that - but that is a different problem), HDFS has 1 nameservice in HA. Both namenodes crash periodically after a bunch of (80K+) - 

 

 

2018-02-13 03:16:50,843 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of postponedMisreplicatedBlocks completed in 56 msecs. 8749610 blocks are left. 1 blocks are re moved.

 

 

then It throws a thread dump - 

 

2018-02-13 03:16:50,843 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of postponedMisreplicatedBlocks completed in 56 msecs. 8749610 blocks are left. 1 blocks are removed.
2018-02-13 03:16:51,512 INFO org.apache.hadoop.http.HttpServer2: Process Thread Dump: jsp requested
232 active threads
Thread 1143 (802274748@qtp-1636050357-2):
  State: RUNNABLE
  Blocked count: 10
emoved.
2018-02-13 03:16:44,732 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of postponedMisreplicatedBlocks completed in 47 msecs. 8749611 blocks are left. 0 blocks are re
moved.
2018-02-13 03:16:47,787 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of postponedMisreplicatedBlocks completed in 54 msecs. 8749611 blocks are left. 0 blocks are re
moved.
2018-02-13 03:16:50,843 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of postponedMisreplicatedBlocks completed in 56 msecs. 8749610 blocks are left. 1 blocks are re
moved.
2018-02-13 03:16:51,512 INFO org.apache.hadoop.http.HttpServer2: Process Thread Dump: jsp requested
232 active threads
Thread 1143 (802274748@qtp-1636050357-2):
  State: RUNNABLE
  Blocked count: 10
  Waited count: 10
  Stack:
    sun.management.ThreadImpl.getThreadInfo1(Native Method)
    sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:178)
    sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:139)
    org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:165)
    org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:219)
    org.apache.hadoop.http.HttpServer2$StackServlet.doGet(HttpServer2.java:1164)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
    org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
    org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
    org.apache.hadoop.security.AuthenticationWithProxyUserFilter.doFilter(AuthenticationWithProxyUserFilter.java:96)
    org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:574)
    org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1296)
    org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
    org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
    org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
Thread 1126 (RMI TCP Connection(56)-127.0.0.1):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 1
  Stack:
    java.net.SocketInputStream.socketRead0(Native Method)
    java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    java.net.SocketInputStream.read(SocketInputStream.java:170)
    java.net.SocketInputStream.read(SocketInputStream.java:141)
    java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    java.io.BufferedInputStream.read(BufferedInputStream.java:265)
    java.io.FilterInputStream.read(FilterInputStream.java:83)
    sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:550)
    sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:826)
    sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:683)
    sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$10/626277472.run(Unknown Source)
    java.security.AccessController.doPrivileged(Native Method)
    sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:682) ...

The same set of events happens on both the NameNodes in our HA setup and NameService crashes. Can people help me understand what is going on?

2 REPLIES 2

avatar
Mentor
What CDH version, and could you attach/pastebin the full stack trace dump
that the log produces?

I'd also lookout for a FATAL message in the logs. A self-abort in NameNode
should always carry that.

avatar

It is CDH 5.11.2. I have nearly 2 GB of rolled up logs and not a single FATAL message in there. Is there a way to force these messages? The way I understand that it has crashed is the service hadoop-hdfs-namenode status is FAILED and I need to restart the namenode manually - after which it works as if nothing was wrong.