Name nodes crash due to journal timeouts

martinserrano — Mon, 07 Nov 2016 21:04:46 GMT

We are experiencing an issue with the following configuration combination:

HA HDFS
Kerberos
TLS1 (hdfs data node protection and rpc protection set to privacy)

In this situation, the name nodes are eventually shutting down due to journal timeouts such as (there are numerous examples in our logs):

2016-11-07 11:41:58,556 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 119125 ms (timeout=120000 ms) for a response for getJournalState(). No responses yet.
2016-11-07 11:41:59,433 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [10.17.49.75:8485, 10.17.49.76:8485, 10.17.49.77:8485], stream=null))
java.io.IOException: Timed out waiting 120000ms for a quorum of nodes to respond.
	at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
	at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.createNewUniqueEpoch(QuorumJournalManager.java:183)
	at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnfinalizedSegments(QuorumJournalManager.java:441)
	at org.apache.hadoop.hdfs.server.namenode.JournalSet$8.apply(JournalSet.java:624)

Around the same time on the journal nodes, we see:

2016-11-07 11:43:30,370 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8485 caught an exception
java.lang.NullPointerException
        at com.sun.security.sasl.gsskerb.GssKrb5Base.wrap(GssKrb5Base.java:103)
        at org.apache.hadoop.ipc.Server.wrapWithSasl(Server.java:2436)
        at org.apache.hadoop.ipc.Server.setupResponse(Server.java:2392)
        at org.apache.hadoop.ipc.Server.access$2500(Server.java:134)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
2016-11-07 11:43:30,370 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 8485 caught an exception

I don't know if these are related. We've tried increasing the journal related timeouts, but that just seems to shift the problem around. We are running 5.8.2 on an 8-node test cluster. The journal processes (3) are running on different machines than the 2 name nodes. Any pointers on how to debug this would be appreciated.

Re: Name nodes crash due to journal timeouts

martinserrano — Mon, 07 Nov 2016 20:49:15 GMT

I see that this implies that the security context for the SASL setup is null. I am turning up logging on com.sun.security.sasl to see if it illuminates anything.

Re: Name nodes crash due to journal timeouts

martinserrano — Tue, 15 Nov 2016 15:06:07 GMT

Eventually what I discovered is two things:

* The NPE in GssKrb5Base seems to be just a sign that the connection was closed. I suspect this is just a consequence of the NN shutting down without closing a connection.
* Essentially the disk the journals was on was just too busy at certain times. By adjusting the disk storage this problem was resolved. Ultimately, I needed to have the YARN resource manager on a different disk than the name node and journal processes.

question Re: Name nodes crash due to journal timeouts in Archives of Support Questions (Read Only)

Name nodes crash due to journal timeouts

Re: Name nodes crash due to journal timeouts

Re: Name nodes crash due to journal timeouts