Support Questions

Find answers, ask questions, and share your expertise

Name nodes crash due to journal timeouts

avatar
New Contributor

We are experiencing an issue with the following configuration combination:

  • HA HDFS
  • Kerberos
  • TLS1 (hdfs data node protection and rpc protection set to privacy)

In this situation, the name nodes are eventually shutting down due to journal timeouts such as (there are numerous examples in our logs):

 

2016-11-07 11:41:58,556 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 119125 ms (timeout=120000 ms) for a response for getJournalState(). No responses yet.
2016-11-07 11:41:59,433 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [10.17.49.75:8485, 10.17.49.76:8485, 10.17.49.77:8485], stream=null))
java.io.IOException: Timed out waiting 120000ms for a quorum of nodes to respond.
	at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
	at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.createNewUniqueEpoch(QuorumJournalManager.java:183)
	at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnfinalizedSegments(QuorumJournalManager.java:441)
	at org.apache.hadoop.hdfs.server.namenode.JournalSet$8.apply(JournalSet.java:624)

Around the same time on the journal nodes, we see:

 

2016-11-07 11:43:30,370 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8485 caught an exception
java.lang.NullPointerException
        at com.sun.security.sasl.gsskerb.GssKrb5Base.wrap(GssKrb5Base.java:103)
        at org.apache.hadoop.ipc.Server.wrapWithSasl(Server.java:2436)
        at org.apache.hadoop.ipc.Server.setupResponse(Server.java:2392)
        at org.apache.hadoop.ipc.Server.access$2500(Server.java:134)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
2016-11-07 11:43:30,370 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 8485 caught an exception

I don't know if these are related.  We've tried increasing the journal related timeouts, but that just seems to shift the problem around.  We are running 5.8.2 on an 8-node test cluster.  The journal processes (3) are running on different machines than the 2 name nodes.  Any pointers on how to debug this would be appreciated.

1 ACCEPTED SOLUTION

avatar
New Contributor
Eventually what I discovered is two things:

* The NPE in GssKrb5Base seems to be just a sign that the connection was closed. I suspect this is just a consequence of the NN shutting down without closing a connection.
* Essentially the disk the journals was on was just too busy at certain times. By adjusting the disk storage this problem was resolved. Ultimately, I needed to have the YARN resource manager on a different disk than the name node and journal processes.

View solution in original post

2 REPLIES 2

avatar
New Contributor
I see that this implies that the security context for the SASL setup is null. I am turning up logging on com.sun.security.sasl to see if it illuminates anything.

avatar
New Contributor
Eventually what I discovered is two things:

* The NPE in GssKrb5Base seems to be just a sign that the connection was closed. I suspect this is just a consequence of the NN shutting down without closing a connection.
* Essentially the disk the journals was on was just too busy at certain times. By adjusting the disk storage this problem was resolved. Ultimately, I needed to have the YARN resource manager on a different disk than the name node and journal processes.