Created on 11-07-2016 09:32 AM - last edited on 11-07-2016 01:04 PM by cjervis
We are experiencing an issue with the following configuration combination:
In this situation, the name nodes are eventually shutting down due to journal timeouts such as (there are numerous examples in our logs):
2016-11-07 11:41:58,556 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 119125 ms (timeout=120000 ms) for a response for getJournalState(). No responses yet. 2016-11-07 11:41:59,433 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [10.17.49.75:8485, 10.17.49.76:8485, 10.17.49.77:8485], stream=null)) java.io.IOException: Timed out waiting 120000ms for a quorum of nodes to respond. at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137) at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.createNewUniqueEpoch(QuorumJournalManager.java:183) at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnfinalizedSegments(QuorumJournalManager.java:441) at org.apache.hadoop.hdfs.server.namenode.JournalSet$8.apply(JournalSet.java:624)
Around the same time on the journal nodes, we see:
2016-11-07 11:43:30,370 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8485 caught an exception java.lang.NullPointerException at com.sun.security.sasl.gsskerb.GssKrb5Base.wrap(GssKrb5Base.java:103) at org.apache.hadoop.ipc.Server.wrapWithSasl(Server.java:2436) at org.apache.hadoop.ipc.Server.setupResponse(Server.java:2392) at org.apache.hadoop.ipc.Server.access$2500(Server.java:134) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131) 2016-11-07 11:43:30,370 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 8485 caught an exception
I don't know if these are related. We've tried increasing the journal related timeouts, but that just seems to shift the problem around. We are running 5.8.2 on an 8-node test cluster. The journal processes (3) are running on different machines than the 2 name nodes. Any pointers on how to debug this would be appreciated.
Created 11-15-2016 07:06 AM
Created 11-07-2016 12:49 PM
Created 11-15-2016 07:06 AM