Member since
11-07-2016
3
Posts
0
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
7177 | 11-15-2016 07:06 AM |
11-15-2016
07:06 AM
Eventually what I discovered is two things: * The NPE in GssKrb5Base seems to be just a sign that the connection was closed. I suspect this is just a consequence of the NN shutting down without closing a connection. * Essentially the disk the journals was on was just too busy at certain times. By adjusting the disk storage this problem was resolved. Ultimately, I needed to have the YARN resource manager on a different disk than the name node and journal processes.
... View more
11-07-2016
12:49 PM
I see that this implies that the security context for the SASL setup is null. I am turning up logging on com.sun.security.sasl to see if it illuminates anything.
... View more
11-07-2016
09:32 AM
We are experiencing an issue with the following configuration combination:
HA HDFS
Kerberos
TLS1 (hdfs data node protection and rpc protection set to privacy)
In this situation, the name nodes are eventually shutting down due to journal timeouts such as (there are numerous examples in our logs):
2016-11-07 11:41:58,556 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 119125 ms (timeout=120000 ms) for a response for getJournalState(). No responses yet.
2016-11-07 11:41:59,433 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [10.17.49.75:8485, 10.17.49.76:8485, 10.17.49.77:8485], stream=null))
java.io.IOException: Timed out waiting 120000ms for a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.createNewUniqueEpoch(QuorumJournalManager.java:183)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnfinalizedSegments(QuorumJournalManager.java:441)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$8.apply(JournalSet.java:624)
Around the same time on the journal nodes, we see:
2016-11-07 11:43:30,370 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8485 caught an exception
java.lang.NullPointerException
at com.sun.security.sasl.gsskerb.GssKrb5Base.wrap(GssKrb5Base.java:103)
at org.apache.hadoop.ipc.Server.wrapWithSasl(Server.java:2436)
at org.apache.hadoop.ipc.Server.setupResponse(Server.java:2392)
at org.apache.hadoop.ipc.Server.access$2500(Server.java:134)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
2016-11-07 11:43:30,370 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 8485 caught an exception
I don't know if these are related. We've tried increasing the journal related timeouts, but that just seems to shift the problem around. We are running 5.8.2 on an 8-node test cluster. The journal processes (3) are running on different machines than the 2 name nodes. Any pointers on how to debug this would be appreciated.
... View more