Created on 
    
	
		
		
		11-07-2016
	
		
		09:32 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
 - last edited on 
    
	
		
		
		11-07-2016
	
		
		01:04 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
 by 
				
		
		
			cjervis
		
		
		
		
		
		
		
		
	
			
		
We are experiencing an issue with the following configuration combination:
In this situation, the name nodes are eventually shutting down due to journal timeouts such as (there are numerous examples in our logs):
2016-11-07 11:41:58,556 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 119125 ms (timeout=120000 ms) for a response for getJournalState(). No responses yet. 2016-11-07 11:41:59,433 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [10.17.49.75:8485, 10.17.49.76:8485, 10.17.49.77:8485], stream=null)) java.io.IOException: Timed out waiting 120000ms for a quorum of nodes to respond. at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137) at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.createNewUniqueEpoch(QuorumJournalManager.java:183) at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnfinalizedSegments(QuorumJournalManager.java:441) at org.apache.hadoop.hdfs.server.namenode.JournalSet$8.apply(JournalSet.java:624)
Around the same time on the journal nodes, we see:
2016-11-07 11:43:30,370 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8485 caught an exception
java.lang.NullPointerException
        at com.sun.security.sasl.gsskerb.GssKrb5Base.wrap(GssKrb5Base.java:103)
        at org.apache.hadoop.ipc.Server.wrapWithSasl(Server.java:2436)
        at org.apache.hadoop.ipc.Server.setupResponse(Server.java:2392)
        at org.apache.hadoop.ipc.Server.access$2500(Server.java:134)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
2016-11-07 11:43:30,370 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 8485 caught an exception
I don't know if these are related. We've tried increasing the journal related timeouts, but that just seems to shift the problem around. We are running 5.8.2 on an 8-node test cluster. The journal processes (3) are running on different machines than the 2 name nodes. Any pointers on how to debug this would be appreciated.
Created 11-15-2016 07:06 AM
Created 11-07-2016 12:49 PM
Created 11-15-2016 07:06 AM
 
					
				
				
			
		
