Eventually what I discovered is two things: * The NPE in GssKrb5Base seems to be just a sign that the connection was closed. I suspect this is just a consequence of the NN shutting down without closing a connection. * Essentially the disk the journals was on was just too busy at certain times. By adjusting the disk storage this problem was resolved. Ultimately, I needed to have the YARN resource manager on a different disk than the name node and journal processes.
... View more
We are experiencing an issue with the following configuration combination:
TLS1 (hdfs data node protection and rpc protection set to privacy)
In this situation, the name nodes are eventually shutting down due to journal timeouts such as (there are numerous examples in our logs):
2016-11-07 11:41:58,556 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 119125 ms (timeout=120000 ms) for a response for getJournalState(). No responses yet.
2016-11-07 11:41:59,433 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [10.17.49.75:8485, 10.17.49.76:8485, 10.17.49.77:8485], stream=null))
java.io.IOException: Timed out waiting 120000ms for a quorum of nodes to respond.
Around the same time on the journal nodes, we see:
2016-11-07 11:43:30,370 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8485 caught an exception
2016-11-07 11:43:30,370 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 8485 caught an exception
I don't know if these are related. We've tried increasing the journal related timeouts, but that just seems to shift the problem around. We are running 5.8.2 on an 8-node test cluster. The journal processes (3) are running on different machines than the 2 name nodes. Any pointers on how to debug this would be appreciated.
... View more