Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.

Hadoop NameNodes in HA crash every time after a lot of postponedMisreplicatedBlocks

Hadoop NameNodes in HA crash every time after a lot of postponedMisreplicatedBlocks

New Contributor

We have a cluster NOT managed by Cloudera Manager (I wish, I could change that - but that is a different problem), HDFS has 1 nameservice in HA. Both namenodes crash periodically after a bunch of (80K+) - 

 

 

2018-02-13 03:16:50,843 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of postponedMisreplicatedBlocks completed in 56 msecs. 8749610 blocks are left. 1 blocks are re moved.

 

 

then It throws a thread dump - 

 

2018-02-13 03:16:50,843 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of postponedMisreplicatedBlocks completed in 56 msecs. 8749610 blocks are left. 1 blocks are removed.
2018-02-13 03:16:51,512 INFO org.apache.hadoop.http.HttpServer2: Process Thread Dump: jsp requested
232 active threads
Thread 1143 (802274748@qtp-1636050357-2):
  State: RUNNABLE
  Blocked count: 10
emoved.
2018-02-13 03:16:44,732 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of postponedMisreplicatedBlocks completed in 47 msecs. 8749611 blocks are left. 0 blocks are re
moved.
2018-02-13 03:16:47,787 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of postponedMisreplicatedBlocks completed in 54 msecs. 8749611 blocks are left. 0 blocks are re
moved.
2018-02-13 03:16:50,843 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of postponedMisreplicatedBlocks completed in 56 msecs. 8749610 blocks are left. 1 blocks are re
moved.
2018-02-13 03:16:51,512 INFO org.apache.hadoop.http.HttpServer2: Process Thread Dump: jsp requested
232 active threads
Thread 1143 (802274748@qtp-1636050357-2):
  State: RUNNABLE
  Blocked count: 10
  Waited count: 10
  Stack:
    sun.management.ThreadImpl.getThreadInfo1(Native Method)
    sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:178)
    sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:139)
    org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:165)
    org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:219)
    org.apache.hadoop.http.HttpServer2$StackServlet.doGet(HttpServer2.java:1164)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
    org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
    org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
    org.apache.hadoop.security.AuthenticationWithProxyUserFilter.doFilter(AuthenticationWithProxyUserFilter.java:96)
    org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:574)
    org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1296)
    org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
    org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
    org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
Thread 1126 (RMI TCP Connection(56)-127.0.0.1):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 1
  Stack:
    java.net.SocketInputStream.socketRead0(Native Method)
    java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    java.net.SocketInputStream.read(SocketInputStream.java:170)
    java.net.SocketInputStream.read(SocketInputStream.java:141)
    java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    java.io.BufferedInputStream.read(BufferedInputStream.java:265)
    java.io.FilterInputStream.read(FilterInputStream.java:83)
    sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:550)
    sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:826)
    sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:683)
    sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$10/626277472.run(Unknown Source)
    java.security.AccessController.doPrivileged(Native Method)
    sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:682) ...

The same set of events happens on both the NameNodes in our HA setup and NameService crashes. Can people help me understand what is going on?

2 REPLIES 2

Re: Hadoop NameNodes in HA crash every time after a lot of postponedMisreplicatedBlocks

Master Guru
What CDH version, and could you attach/pastebin the full stack trace dump
that the log produces?

I'd also lookout for a FATAL message in the logs. A self-abort in NameNode
should always carry that.

Re: Hadoop NameNodes in HA crash every time after a lot of postponedMisreplicatedBlocks

New Contributor

It is CDH 5.11.2. I have nearly 2 GB of rolled up logs and not a single FATAL message in there. Is there a way to force these messages? The way I understand that it has crashed is the service hadoop-hdfs-namenode status is FAILED and I need to restart the namenode manually - after which it works as if nothing was wrong.