Created on 02-12-2018 09:53 PM - edited 09-16-2022 05:51 AM
We have a cluster NOT managed by Cloudera Manager (I wish, I could change that - but that is a different problem), HDFS has 1 nameservice in HA. Both namenodes crash periodically after a bunch of (80K+) -
2018-02-13 03:16:50,843 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of postponedMisreplicatedBlocks completed in 56 msecs. 8749610 blocks are left. 1 blocks are re moved.
then It throws a thread dump -
2018-02-13 03:16:50,843 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of postponedMisreplicatedBlocks completed in 56 msecs. 8749610 blocks are left. 1 blocks are removed. 2018-02-13 03:16:51,512 INFO org.apache.hadoop.http.HttpServer2: Process Thread Dump: jsp requested 232 active threads Thread 1143 (802274748@qtp-1636050357-2): State: RUNNABLE Blocked count: 10 emoved. 2018-02-13 03:16:44,732 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of postponedMisreplicatedBlocks completed in 47 msecs. 8749611 blocks are left. 0 blocks are re moved. 2018-02-13 03:16:47,787 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of postponedMisreplicatedBlocks completed in 54 msecs. 8749611 blocks are left. 0 blocks are re moved. 2018-02-13 03:16:50,843 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of postponedMisreplicatedBlocks completed in 56 msecs. 8749610 blocks are left. 1 blocks are re moved. 2018-02-13 03:16:51,512 INFO org.apache.hadoop.http.HttpServer2: Process Thread Dump: jsp requested 232 active threads Thread 1143 (802274748@qtp-1636050357-2): State: RUNNABLE Blocked count: 10 Waited count: 10 Stack: sun.management.ThreadImpl.getThreadInfo1(Native Method) sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:178) sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:139) org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:165) org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:219) org.apache.hadoop.http.HttpServer2$StackServlet.doGet(HttpServer2.java:1164) javax.servlet.http.HttpServlet.service(HttpServlet.java:707) javax.servlet.http.HttpServlet.service(HttpServlet.java:820) org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) org.apache.hadoop.security.AuthenticationWithProxyUserFilter.doFilter(AuthenticationWithProxyUserFilter.java:96) org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:574) org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1296) org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) Thread 1126 (RMI TCP Connection(56)-127.0.0.1): State: RUNNABLE Blocked count: 0 Waited count: 1 Stack: java.net.SocketInputStream.socketRead0(Native Method) java.net.SocketInputStream.socketRead(SocketInputStream.java:116) java.net.SocketInputStream.read(SocketInputStream.java:170) java.net.SocketInputStream.read(SocketInputStream.java:141) java.io.BufferedInputStream.fill(BufferedInputStream.java:246) java.io.BufferedInputStream.read(BufferedInputStream.java:265) java.io.FilterInputStream.read(FilterInputStream.java:83) sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:550) sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:826) sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:683) sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$10/626277472.run(Unknown Source) java.security.AccessController.doPrivileged(Native Method) sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:682) ...
The same set of events happens on both the NameNodes in our HA setup and NameService crashes. Can people help me understand what is going on?
Created 02-12-2018 10:13 PM
Created 02-12-2018 11:04 PM
It is CDH 5.11.2. I have nearly 2 GB of rolled up logs and not a single FATAL message in there. Is there a way to force these messages? The way I understand that it has crashed is the service hadoop-hdfs-namenode status is FAILED and I need to restart the namenode manually - after which it works as if nothing was wrong.