Created 12-19-2018 12:09 PM
We are running Hadoop cluster in HA environment. The name node service got failed in active and standby servers.
As per the logs, its failed due to QJM waited for quorum and did not get response then timed out.
Logs from Active Node:
2018-12-17 20:57:01,587 WARN client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 19014 ms (timeout=20000 ms) for a response for sendEdits. No responses yet. 2018-12-17 20:57:02,574 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.106.8.78:8485, 10.106.8.145:8485, 10.106.8.161:8485], stream=QuorumOutputStream starting at txid 62507384)) java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond. at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137) at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393) at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:654) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4018) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:1102) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:630) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307) 2018-12-17 20:57:02,574 WARN client.QuorumJournalManager (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting at txid 62507384 2018-12-17 20:57:02,587 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1 2018-12-17 20:57:02,592 INFO provider.AuditProviderFactory (AuditProviderFactory.java:run(501)) - ==> JVMShutdownHook.run() 2018-12-17 20:57:02,592 INFO provider.AuditProviderFactory (AuditProviderFactory.java:run(502)) - JVMShutdownHook: Signalling async audit cleanup to start. 2018-12-17 20:57:02,592 INFO provider.AuditProviderFactory (AuditProviderFactory.java:run(477)) - RangerAsyncAuditCleanup: Starting cleanup 2018-12-17 20:57:02,592 INFO provider.AuditProviderFactory (AuditProviderFactory.java:run(506)) - JVMShutdownHook: Waiting up to 30 seconds for audit cleanup to finish. 2018-12-17 20:57:02,593 INFO queue.AuditBatchQueue (AuditBatchQueue.java:runLogAudit(281)) - Caught exception in consumer thread. Shutdown might be in progress 2018-12-17 20:57:02,593 INFO queue.AuditBatchQueue (AuditBatchQueue.java:runLogAudit(332)) - Queue is not empty. Will retry. queue.size)=838857, localBatchBuffer.size()=0 2018-12-17 20:57:02,605 INFO queue.AuditBatchQueue (AuditBatchQueue.java:runLogAudit(332)) - Queue is not empty. Will retry. queue.size)=837857, localBatchBuffer.size( 2018-12-17 20:57:12,571 INFO queue.AuditBatchQueue (AuditBatchQueue.java:runLogAudit(332)) - Queue is not empty. Will retry. queue.size)=12859, localBatchBuffer.size()=0 2018-12-17 20:57:12,583 INFO queue.AuditBatchQueue (AuditBatchQueue.java:runLogAudit(332)) - Queue is not empty. Will retry. queue.size)=11859, localBatchBuffer.size()=0 2018-12-17 20:57:12,592 WARN util.ShutdownHookManager (ShutdownHookManager.java:run(70)) - ShutdownHook 'JVMShutdownHook' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask.get(FutureTask.java:205) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 2018-12-17 20:57:12,592 INFO provider.AuditProviderFactory (AuditProviderFactory.java:run(514)) - JVMShutdownHook: Interrupted while waiting for completion of Async executor! java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) at java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:409) at org.apache.ranger.audit.provider.AuditProviderFactory$JVMShutdownHook.run(AuditProviderFactory.java:507) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2018-12-17 20:57:12,592 INFO provider.AuditProviderFactory (AuditProviderFactory.java:run(516)) - JVMShutdownHook: Interrupting ranger async audit cleanup thread 2018-12-17 20:57:12,592 INFO provider.AuditProviderFactory (AuditProviderFactory.java:run(518)) - <== JVMShutdownHook.run() 2018-12-17 20:57:12,592 INFO queue.AuditAsyncQueue (AuditAsyncQueue.java:stop(106)) - Stop called. name=hdfs.async 2018-12-17 20:57:12,593 INFO queue.AuditAsyncQueue (AuditAsyncQueue.java:stop(110)) - Interrupting consumerThread. name=hdfs.async, consumer=hdfs.async.multi_dest 2018-12-17 20:57:12,593 INFO provider.AuditProviderFactory (AuditProviderFactory.java:run(481)) - RangerAsyncAuditCleanup: Done cleanup 2018-12-17 20:57:12,593 INFO provider.AuditProviderFactory (AuditProviderFactory.java:run(470)) - RangerAsyncAuditCleanup: Waiting to audit cleanup start signal 2018-12-17 20:57:12,593 INFO queue.AuditAsyncQueue (AuditAsyncQueue.java:runLogAudit(155)) - Caught exception in consumer thread. Shutdown might be in progress 2018-12-17 20:57:12,596 INFO queue.AuditAsyncQueue (AuditAsyncQueue.java:runLogAudit(171)) - Exiting polling loop. name=hdfs.async 2018-12-17 20:57:12,596 INFO queue.AuditAsyncQueue (AuditAsyncQueue.java:runLogAudit(175)) - Calling to stop consumer. name=hdfs.async, consumer.name=hdfs.async.multi_dest 2018-12-17 20:57:12,596 INFO namenode.NameNode (LogAdapter.java:info(47)) - SHUTDOWN_MSG:
Logs from StandbyNode:
2018-12-17 20:55:17,191 INFO authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(137)) - Authorization successful for yarn/prdhdpmn1.example.com@example.COM (auth:KERBEROS) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol 2018-12-17 20:55:17,742 INFO provider.BaseAuditHandler (BaseAuditHandler.java:logStatus(312)) - Audit Status Log: name=hdfs.async.multi_dest.batch.solr, interval=01:00.001 minutes, events=3, succcessCount=3, totalEvents=41545284, totalSuccessCount=35647876, totalFailedCount=5897408 2018-12-17 20:55:17,743 INFO BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1588)) - BLOCK* neededReplications = 0, pendingReplications = 0. 2018-12-17 20:55:17,780 INFO provider.BaseAuditHandler (BaseAuditHandler.java:logStatus(312)) - Audit Status Log: name=hdfs.async.multi_dest.batch.hdfs, interval=01:00.184 minutes, events=1000, deferredCount=1000, totalEvents=11155910, totalSuccessCount=58910, totalDeferredCount=11097000 2018-12-17 20:55:17,780 INFO destination.HDFSAuditDestination (HDFSAuditDestination.java:createConfiguration(310)) - Returning HDFS Filesystem Config: Configuration: core-default.xml, core-site.xml, hdfs-default.xml, hdfs-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml 2018-12-17 20:55:17,794 INFO destination.HDFSAuditDestination (HDFSAuditDestination.java:getLogFileStream(271)) - Checking whether log file exists. hdfPath=hdfs://prdhdpmn1.example.com:8020/ranger/audit/hdfs/20181217/hdfs_ranger_audit_prdhdpmn2.example.com.log, UGI=nn/prdhdpmn2.example.com@example.COM (auth:KERBEROS) 2018-12-17 20:55:17,803 WARN retry.RetryInvocationHandler (RetryInvocationHandler.java:handleException(217)) - Exception while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over null. Not retrying because try once and fail. org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1979) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1345) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3967) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1130) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:851) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1552) at org.apache.hadoop.ipc.Client.call(Client.java:1496) at org.apache.hadoop.ipc.Client.call(Client.java:1396) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at com.sun.proxy.$Proxy33.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:816) at sun.reflect.GeneratedMethodAccessor271.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:278) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:194) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:176) at com.sun.proxy.$Proxy34.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2158) at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1423) at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1419) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1419) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1443) at org.apache.ranger.audit.destination.HDFSAuditDestination.getLogFileStream(HDFSAuditDestination.java:273) at org.apache.ranger.audit.destination.HDFSAuditDestination.access$000(HDFSAuditDestination.java:44) at org.apache.ranger.audit.destination.HDFSAuditDestination$1.run(HDFSAuditDestination.java:159) at org.apache.ranger.audit.destination.HDFSAuditDestination$1.run(HDFSAuditDestination.java:156) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.ranger.audit.destination.HDFSAuditDestination.logJSON(HDFSAuditDestination.java:170) at org.apache.ranger.audit.queue.AuditFileSpool.sendEvent(AuditFileSpool.java:880) at org.apache.ranger.audit.queue.AuditFileSpool.runLogAudit(AuditFileSpool.java:819) at org.apache.ranger.audit.queue.AuditFileSpool.run(AuditFileSpool.java:758) at java.lang.Thread.run(Thread.java:745) 2018-12-17 20:55:17,803 ERROR provider.BaseAuditHandler (BaseAuditHandler.java:logError(329)) - Error writing to log file. org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1979) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1345) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3967) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1130) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:851) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1552) at org.apache.hadoop.ipc.Client.call(Client.java:1496) at org.apache.hadoop.ipc.Client.call(Client.java:1396) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at com.sun.proxy.$Proxy33.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:816) at sun.reflect.GeneratedMethodAccessor271.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:278) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:194) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:176) at com.sun.proxy.$Proxy34.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2158) at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1423) at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1419) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1419) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1443) at org.apache.ranger.audit.destination.HDFSAuditDestination.getLogFileStream(HDFSAuditDestination.java:273) at org.apache.ranger.audit.destination.HDFSAuditDestination.access$000(HDFSAuditDestination.java:44) at org.apache.ranger.audit.destination.HDFSAuditDestination$1.run(HDFSAuditDestination.java:159) at org.apache.ranger.audit.destination.HDFSAuditDestination$1.run(HDFSAuditDestination.java:156) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.ranger.audit.destination.HDFSAuditDestination.logJSON(HDFSAuditDestination.java:170) at org.apache.ranger.audit.queue.AuditFileSpool.sendEvent(AuditFileSpool.java:880) at org.apache.ranger.audit.queue.AuditFileSpool.runLogAudit(AuditFileSpool.java:819) at org.apache.ranger.audit.queue.AuditFileSpool.run(AuditFileSpool.java:758) at java.lang.Thread.run(Thread.java:745) 2018-12-17 20:55:17,803 ERROR queue.AuditFileSpool (AuditFileSpool.java:logError(710)) - Error sending logs to consumer. provider=hdfs.async.multi_dest.batch, consumer=hdfs.async.multi_dest.batch.hdfs 2018-12-17 20:55:17,804 INFO queue.AuditFileSpool (AuditFileSpool.java:runLogAudit(770)) - Destination is down. sleeping for 30000 milli seconds. indexQueue=2, queueName=hdfs.async.multi_dest.batch, consumer=hdfs.async.multi_dest.batch.hdfs 2018-12-17 20:55:20,744 INFO BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1588)) - BLOCK* neededReplications = 0, pendingReplications = 0. 2018-12-17 20:55:23,744 INFO BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1588)) - BLOCK* neededReplications = 0, pendingReplications = 0. 2018-12-17 20:56:26,779 INFO BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1588)) - BLOCK* neededReplications = 0, pendingReplications = 0. 2018-12-17 20:56:29,779 INFO BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1588)) - BLOCK* neededReplications = 0, pendingReplications = 0. 2018-12-17 20:56:30,179 ERROR client.RangerAdminRESTClient (RangerAdminRESTClient.java:getServicePoliciesIfUpdated(124)) - Error getting policies. secureMode=true, user=nn/prdhdpmn2.example.com@example.COM (auth:KERBEROS), response={"httpStatusCode":401,"statusCode":0}, serviceName=ABC_hadoop 2018-12-17 20:56:30,179 ERROR util.PolicyRefresher (PolicyRefresher.java:loadPolicyfromPolicyAdmin(255)) - PolicyRefresher(serviceName=ABC_hadoop): failed to refresh policies. Will continue to use last known version of policies (222) java.lang.Exception: HTTP 401 at org.apache.ranger.admin.client.RangerAdminRESTClient.getServicePoliciesIfUpdated(RangerAdminRESTClient.java:126) at org.apache.ranger.plugin.util.PolicyRefresher.loadPolicyfromPolicyAdmin(PolicyRefresher.java:232) at org.apache.ranger.plugin.util.PolicyRefresher.loadPolicy(PolicyRefresher.java:188) at org.apache.ranger.plugin.util.PolicyRefresher.run(PolicyRefresher.java:158) 2018-12-17 20:56:32,780 INFO BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1588)) - BLOCK* neededReplications = 0, pendingReplications = 0. 2018-12-17 20:56:35,780 INFO BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1588)) - BLOCK* neededReplications = 0, pendingReplications = 0. 2018-12-17 20:56:38,780 INFO BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1588)) - BLOCK* neededReplications = 0, pendingReplications = 0. 2018-12-17 20:56:38,901 INFO ipc.Server (Server.java:saslProcess(1538)) - Auth successful for jhs/prdhdpmn1.example.com@example.COM (auth:KERBEROS) 2018-12-17 20:56:38,902 INFO authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(137)) - Authorization successful for jhs/prdhdpmn1.example.com@example.COM (auth:KERBEROS) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol 2018-12-17 20:56:41,780 INFO BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1588)) - BLOCK* neededReplications = 0, pendingReplications = 0. 2018-12-17 20:56:42,500 INFO ipc.Server (Server.java:saslProcess(1538)) - Auth successful for ambari-qa-ABC@example.COM (auth:KERBEROS) 2018-12-17 20:56:42,501 INFO authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(137)) - Authorization successful for ambari-qa-ABC@example.COM (auth:KERBEROS) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol 2018-12-17 20:56:42,574 INFO namenode.FSEditLog (FSEditLog.java:printStatistics(716)) - Number of transactions: 2 Total time for transactions(ms): 2 Number of transactions batched in Syncs: 0 Number of syncs: 1 SyncTimes(ms): 2 5 2018-12-17 20:56:43,687 INFO ipc.Server (Server.java:saslProcess(1538)) - Auth successful for user1@example.COM (auth:KERBEROS) 2018-12-17 20:56:43,687 INFO authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(137)) - Authorization successful for user1@example.COM (auth:KERBEROS) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol 2018-12-17 20:56:44,781 INFO BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1588)) - BLOCK* neededReplications = 0, pendingReplications = 0. 2018-12-17 20:56:47,781 INFO BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1588)) - BLOCK* neededReplications = 0, pendingReplications = 0. 2018-12-17 20:56:48,574 INFO client.QuorumJournalManager (QuorumCall.java:waitFor(136)) - Waited 6001 ms (timeout=20000 ms) for a response for sendEdits. No responses yet. 2018-12-17 20:56:49,576 INFO client.QuorumJournalManager (QuorumCall.java:waitFor(136)) - Waited 7002 ms (timeout=20000 ms) for a response for sendEdits. No responses yet. 2018-12-17 20:57:00,587 WARN client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 18013 ms (timeout=20000 ms) for a response for sendEdits. No responses yet. 2018-12-17 20:57:01,587 WARN client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 19014 ms (timeout=20000 ms) for a response for sendEdits. No responses yet. 2018-12-17 20:57:02,574 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [<JN1_IP>:8485, <JN2_IP>:8485, <JN2_IP>:8485], stream=QuorumOutputStream starting at txid 62507384)) java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond. at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137) at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393) at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:654) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4018) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:1102) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:630) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307) 2018-12-17 20:57:02,574 WARN client.QuorumJournalManager (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting at txid 62507384 2018-12-17 20:57:02,587 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1 2018-12-17 20:57:02,592 INFO provider.AuditProviderFactory (AuditProviderFactory.java:run(501)) - ==> JVMShutdownHook.run() 2018-12-17 20:57:02,592 INFO provider.AuditProviderFactory (AuditProviderFactory.java:run(502)) - JVMShutdownHook: Signalling async audit cleanup to start. 2018-12-17 20:57:02,592 INFO provider.AuditProviderFactory (AuditProviderFactory.java:run(477)) - RangerAsyncAuditCleanup: Starting cleanup 2018-12-17 20:57:02,592 INFO provider.AuditProviderFactory (AuditProviderFactory.java:run(506)) - JVMShutdownHook: Waiting up to 30 seconds for audit cleanup to finish. 2018-12-17 20:57:02,593 INFO queue.AuditBatchQueue (AuditBatchQueue.java:runLogAudit(281)) - Caught exception in consumer thread. Shutdown might be in progress 2018-12-17 20:57:02,593 INFO queue.AuditBatchQueue (AuditBatchQueue.java:runLogAudit(332)) - Queue is not empty. Will retry. queue.size)=838857, localBatchBuffer.size()=0 2018-12-17 20:57:02,605 INFO queue.AuditBatchQueue (AuditBatchQueue.java:runLogAudit(332)) - Queue is not empty. Will retry. queue.size)=837857, localBatchBuffer.size()=0 2018-12-17 20:57:12,583 INFO queue.AuditBatchQueue (AuditBatchQueue.java:runLogAudit(332)) - Queue is not empty. Will retry. queue.size)=11859, localBatchBuffer.size()=0 2018-12-17 20:57:12,592 WARN util.ShutdownHookManager (ShutdownHookManager.java:run(70)) - ShutdownHook 'JVMShutdownHook' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask.get(FutureTask.java:205) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 2018-12-17 20:57:12,592 INFO provider.AuditProviderFactory (AuditProviderFactory.java:run(514)) - JVMShutdownHook: Interrupted while waiting for completion of Async executor! java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) at java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:409) at org.apache.ranger.audit.provider.AuditProviderFactory$JVMShutdownHook.run(AuditProviderFactory.java:507) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2018-12-17 20:57:12,592 INFO provider.AuditProviderFactory (AuditProviderFactory.java:run(516)) - JVMShutdownHook: Interrupting ranger async audit cleanup thread 2018-12-17 20:57:12,592 INFO provider.AuditProviderFactory (AuditProviderFactory.java:run(518)) - <== JVMShutdownHook.run() 2018-12-17 20:57:12,592 INFO queue.AuditAsyncQueue (AuditAsyncQueue.java:stop(106)) - Stop called. name=hdfs.async 2018-12-17 20:57:12,593 INFO queue.AuditAsyncQueue (AuditAsyncQueue.java:stop(110)) - Interrupting consumerThread. name=hdfs.async, consumer=hdfs.async.multi_dest 2018-12-17 20:57:12,593 INFO provider.AuditProviderFactory (AuditProviderFactory.java:run(481)) - RangerAsyncAuditCleanup: Done cleanup 2018-12-17 20:57:12,593 INFO provider.AuditProviderFactory (AuditProviderFactory.java:run(470)) - RangerAsyncAuditCleanup: Waiting to audit cleanup start signal 2018-12-17 20:57:12,593 INFO queue.AuditAsyncQueue (AuditAsyncQueue.java:runLogAudit(155)) - Caught exception in consumer thread. Shutdown might be in progress 2018-12-17 20:57:12,596 INFO queue.AuditAsyncQueue (AuditAsyncQueue.java:runLogAudit(171)) - Exiting polling loop. name=hdfs.async 2018-12-17 20:57:12,596 INFO queue.AuditAsyncQueue (AuditAsyncQueue.java:runLogAudit(175)) - Calling to stop consumer. name=hdfs.async, consumer.name=hdfs.async.multi_dest 2018-12-17 20:57:12,596 INFO namenode.NameNode (LogAdapter.java:info(47)) - SHUTDOWN_MSG:
There were no error in journal node logs.
We could not identify why the QJM failed to get the repose from the journal nodes which caused this failure.
Is there any way to find the root case of this failure ?
Created 12-19-2018 01:34 PM
Hi ,
Please check whether connectivity between Name-node and Journal-node is fine?
Try to check heap usage of Journal-node , as your Journal-nodes are way busy.
Created 12-19-2018 01:54 PM
We did not find any logs about connectivity issue between Name-node and Journal-Node. Every thing fine after restarting the name node services.
Where do I find heap usage by journal-node during this time period?
Created 09-03-2020 10:26 PM
Even , we are facing the same issue. Both the Namenodes got failed on both the cluster.
Any idea,what is going wrong ?
Created 09-04-2020 03:37 AM
Hello @Atradius ,
thank you for reaching out to the Community with your issue of having both NameNodes down.
Do you see in your NN log entries like JvmPauseMonitor saying "Detected pause in JVM or host machine" and a value larger than 1000ms, please? It can be an indication that your service is running out of heap. If it is the NameNode, the short-term solution is to increase the heap and restart the service. A long term solution is to identify why did you run out of heap? E.g. do you face with small files issue? Please read article [1] about how to tackle this.
Losing quorum might be caused by ZK service issue, when the ZK is not in quorum. Please check the ZK logs as well.
Please let us know if you need more input to progress with your investigation.
Best regards:
Ferenc
Ferenc Erdelyi, Technical Solutions Manager
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community: