Support Questions

Find answers, ask questions, and share your expertise

Namenode service failure in Active and standby server

avatar
Contributor

We are running Hadoop cluster in HA environment. The name node service got failed in active and standby servers.

As per the logs, its failed due to QJM waited for quorum and did not get response then timed out.

Logs from Active Node:

2018-12-17 20:57:01,587 WARN  client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 19014 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
2018-12-17 20:57:02,574 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.106.8.78:8485, 10.106.8.145:8485, 10.106.8.161:8485], stream=QuorumOutputStream starting at txid 62507384))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
        at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
        at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
        at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
        at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:654)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4018)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:1102)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:630)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307)
2018-12-17 20:57:02,574 WARN  client.QuorumJournalManager (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting at txid 62507384
2018-12-17 20:57:02,587 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1
2018-12-17 20:57:02,592 INFO  provider.AuditProviderFactory (AuditProviderFactory.java:run(501)) - ==> JVMShutdownHook.run()
2018-12-17 20:57:02,592 INFO  provider.AuditProviderFactory (AuditProviderFactory.java:run(502)) - JVMShutdownHook: Signalling async audit cleanup to start.
2018-12-17 20:57:02,592 INFO  provider.AuditProviderFactory (AuditProviderFactory.java:run(477)) - RangerAsyncAuditCleanup: Starting cleanup
2018-12-17 20:57:02,592 INFO  provider.AuditProviderFactory (AuditProviderFactory.java:run(506)) - JVMShutdownHook: Waiting up to 30 seconds for audit cleanup to finish.
2018-12-17 20:57:02,593 INFO  queue.AuditBatchQueue (AuditBatchQueue.java:runLogAudit(281)) - Caught exception in consumer thread. Shutdown might be in progress
2018-12-17 20:57:02,593 INFO  queue.AuditBatchQueue (AuditBatchQueue.java:runLogAudit(332)) - Queue is not empty. Will retry. queue.size)=838857, localBatchBuffer.size()=0
2018-12-17 20:57:02,605 INFO  queue.AuditBatchQueue (AuditBatchQueue.java:runLogAudit(332)) - Queue is not empty. Will retry. queue.size)=837857, localBatchBuffer.size(








2018-12-17 20:57:12,571 INFO  queue.AuditBatchQueue (AuditBatchQueue.java:runLogAudit(332)) - Queue is not empty. Will retry. queue.size)=12859, localBatchBuffer.size()=0
2018-12-17 20:57:12,583 INFO  queue.AuditBatchQueue (AuditBatchQueue.java:runLogAudit(332)) - Queue is not empty. Will retry. queue.size)=11859, localBatchBuffer.size()=0
2018-12-17 20:57:12,592 WARN  util.ShutdownHookManager (ShutdownHookManager.java:run(70)) - ShutdownHook 'JVMShutdownHook' timeout, java.util.concurrent.TimeoutException
java.util.concurrent.TimeoutException
        at java.util.concurrent.FutureTask.get(FutureTask.java:205)
        at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67)
2018-12-17 20:57:12,592 INFO  provider.AuditProviderFactory (AuditProviderFactory.java:run(514)) - JVMShutdownHook: Interrupted while waiting for completion of Async executor!
java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
        at java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:409)
        at org.apache.ranger.audit.provider.AuditProviderFactory$JVMShutdownHook.run(AuditProviderFactory.java:507)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
2018-12-17 20:57:12,592 INFO  provider.AuditProviderFactory (AuditProviderFactory.java:run(516)) - JVMShutdownHook: Interrupting ranger async audit cleanup thread
2018-12-17 20:57:12,592 INFO  provider.AuditProviderFactory (AuditProviderFactory.java:run(518)) - <== JVMShutdownHook.run()
2018-12-17 20:57:12,592 INFO  queue.AuditAsyncQueue (AuditAsyncQueue.java:stop(106)) - Stop called. name=hdfs.async
2018-12-17 20:57:12,593 INFO  queue.AuditAsyncQueue (AuditAsyncQueue.java:stop(110)) - Interrupting consumerThread. name=hdfs.async, consumer=hdfs.async.multi_dest
2018-12-17 20:57:12,593 INFO  provider.AuditProviderFactory (AuditProviderFactory.java:run(481)) - RangerAsyncAuditCleanup: Done cleanup
2018-12-17 20:57:12,593 INFO  provider.AuditProviderFactory (AuditProviderFactory.java:run(470)) - RangerAsyncAuditCleanup: Waiting to audit cleanup start signal
2018-12-17 20:57:12,593 INFO  queue.AuditAsyncQueue (AuditAsyncQueue.java:runLogAudit(155)) - Caught exception in consumer thread. Shutdown might be in progress
2018-12-17 20:57:12,596 INFO  queue.AuditAsyncQueue (AuditAsyncQueue.java:runLogAudit(171)) - Exiting polling loop. name=hdfs.async
2018-12-17 20:57:12,596 INFO  queue.AuditAsyncQueue (AuditAsyncQueue.java:runLogAudit(175)) - Calling to stop consumer. name=hdfs.async, consumer.name=hdfs.async.multi_dest
2018-12-17 20:57:12,596 INFO  namenode.NameNode (LogAdapter.java:info(47)) - SHUTDOWN_MSG:

Logs from StandbyNode:

2018-12-17 20:55:17,191 INFO  authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(137)) - Authorization successful for yarn/prdhdpmn1.example.com@example.COM (auth:KERBEROS) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol
2018-12-17 20:55:17,742 INFO  provider.BaseAuditHandler (BaseAuditHandler.java:logStatus(312)) - Audit Status Log: name=hdfs.async.multi_dest.batch.solr, interval=01:00.001 minutes, events=3, succcessCount=3, totalEvents=41545284, totalSuccessCount=35647876, totalFailedCount=5897408
2018-12-17 20:55:17,743 INFO  BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1588)) - BLOCK* neededReplications = 0, pendingReplications = 0.
2018-12-17 20:55:17,780 INFO  provider.BaseAuditHandler (BaseAuditHandler.java:logStatus(312)) - Audit Status Log: name=hdfs.async.multi_dest.batch.hdfs, interval=01:00.184 minutes, events=1000, deferredCount=1000, totalEvents=11155910, totalSuccessCount=58910, totalDeferredCount=11097000
2018-12-17 20:55:17,780 INFO  destination.HDFSAuditDestination (HDFSAuditDestination.java:createConfiguration(310)) - Returning HDFS Filesystem Config: Configuration: core-default.xml, core-site.xml, hdfs-default.xml, hdfs-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml
2018-12-17 20:55:17,794 INFO  destination.HDFSAuditDestination (HDFSAuditDestination.java:getLogFileStream(271)) - Checking whether log file exists. hdfPath=hdfs://prdhdpmn1.example.com:8020/ranger/audit/hdfs/20181217/hdfs_ranger_audit_prdhdpmn2.example.com.log, UGI=nn/prdhdpmn2.example.com@example.COM (auth:KERBEROS)
2018-12-17 20:55:17,803 WARN  retry.RetryInvocationHandler (RetryInvocationHandler.java:handleException(217)) - Exception while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over null. Not retrying because try once and fail.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
        at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1979)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1345)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3967)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1130)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:851)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307)


        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1552)
        at org.apache.hadoop.ipc.Client.call(Client.java:1496)
        at org.apache.hadoop.ipc.Client.call(Client.java:1396)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
        at com.sun.proxy.$Proxy33.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:816)
        at sun.reflect.GeneratedMethodAccessor271.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:278)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:194)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:176)
        at com.sun.proxy.$Proxy34.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2158)
        at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1423)
        at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1419)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1419)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1443)
        at org.apache.ranger.audit.destination.HDFSAuditDestination.getLogFileStream(HDFSAuditDestination.java:273)
        at org.apache.ranger.audit.destination.HDFSAuditDestination.access$000(HDFSAuditDestination.java:44)
        at org.apache.ranger.audit.destination.HDFSAuditDestination$1.run(HDFSAuditDestination.java:159)
        at org.apache.ranger.audit.destination.HDFSAuditDestination$1.run(HDFSAuditDestination.java:156)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
        at org.apache.ranger.audit.destination.HDFSAuditDestination.logJSON(HDFSAuditDestination.java:170)
        at org.apache.ranger.audit.queue.AuditFileSpool.sendEvent(AuditFileSpool.java:880)
        at org.apache.ranger.audit.queue.AuditFileSpool.runLogAudit(AuditFileSpool.java:819)
        at org.apache.ranger.audit.queue.AuditFileSpool.run(AuditFileSpool.java:758)
        at java.lang.Thread.run(Thread.java:745)
2018-12-17 20:55:17,803 ERROR provider.BaseAuditHandler (BaseAuditHandler.java:logError(329)) - Error writing to log file.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
        at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1979)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1345)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3967)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1130)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:851)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307)


        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1552)
        at org.apache.hadoop.ipc.Client.call(Client.java:1496)
        at org.apache.hadoop.ipc.Client.call(Client.java:1396)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
        at com.sun.proxy.$Proxy33.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:816)
        at sun.reflect.GeneratedMethodAccessor271.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:278)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:194)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:176)
        at com.sun.proxy.$Proxy34.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2158)
        at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1423)
        at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1419)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1419)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1443)
        at org.apache.ranger.audit.destination.HDFSAuditDestination.getLogFileStream(HDFSAuditDestination.java:273)
        at org.apache.ranger.audit.destination.HDFSAuditDestination.access$000(HDFSAuditDestination.java:44)
        at org.apache.ranger.audit.destination.HDFSAuditDestination$1.run(HDFSAuditDestination.java:159)
        at org.apache.ranger.audit.destination.HDFSAuditDestination$1.run(HDFSAuditDestination.java:156)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
        at org.apache.ranger.audit.destination.HDFSAuditDestination.logJSON(HDFSAuditDestination.java:170)
        at org.apache.ranger.audit.queue.AuditFileSpool.sendEvent(AuditFileSpool.java:880)
        at org.apache.ranger.audit.queue.AuditFileSpool.runLogAudit(AuditFileSpool.java:819)
        at org.apache.ranger.audit.queue.AuditFileSpool.run(AuditFileSpool.java:758)
        at java.lang.Thread.run(Thread.java:745)
2018-12-17 20:55:17,803 ERROR queue.AuditFileSpool (AuditFileSpool.java:logError(710)) - Error sending logs to consumer. provider=hdfs.async.multi_dest.batch, consumer=hdfs.async.multi_dest.batch.hdfs
2018-12-17 20:55:17,804 INFO  queue.AuditFileSpool (AuditFileSpool.java:runLogAudit(770)) - Destination is down. sleeping for 30000 milli seconds. indexQueue=2, queueName=hdfs.async.multi_dest.batch, consumer=hdfs.async.multi_dest.batch.hdfs
2018-12-17 20:55:20,744 INFO  BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1588)) - BLOCK* neededReplications = 0, pendingReplications = 0.
2018-12-17 20:55:23,744 INFO  BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1588)) - BLOCK* neededReplications = 0, pendingReplications = 0.





2018-12-17 20:56:26,779 INFO  BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1588)) - BLOCK* neededReplications = 0, pendingReplications = 0.
2018-12-17 20:56:29,779 INFO  BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1588)) - BLOCK* neededReplications = 0, pendingReplications = 0.
2018-12-17 20:56:30,179 ERROR client.RangerAdminRESTClient (RangerAdminRESTClient.java:getServicePoliciesIfUpdated(124)) - Error getting policies. secureMode=true, user=nn/prdhdpmn2.example.com@example.COM (auth:KERBEROS), response={"httpStatusCode":401,"statusCode":0}, serviceName=ABC_hadoop
2018-12-17 20:56:30,179 ERROR util.PolicyRefresher (PolicyRefresher.java:loadPolicyfromPolicyAdmin(255)) - PolicyRefresher(serviceName=ABC_hadoop): failed to refresh policies. Will continue to use last known version of policies (222)
java.lang.Exception: HTTP 401
        at org.apache.ranger.admin.client.RangerAdminRESTClient.getServicePoliciesIfUpdated(RangerAdminRESTClient.java:126)
        at org.apache.ranger.plugin.util.PolicyRefresher.loadPolicyfromPolicyAdmin(PolicyRefresher.java:232)
        at org.apache.ranger.plugin.util.PolicyRefresher.loadPolicy(PolicyRefresher.java:188)
        at org.apache.ranger.plugin.util.PolicyRefresher.run(PolicyRefresher.java:158)
2018-12-17 20:56:32,780 INFO  BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1588)) - BLOCK* neededReplications = 0, pendingReplications = 0.
2018-12-17 20:56:35,780 INFO  BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1588)) - BLOCK* neededReplications = 0, pendingReplications = 0.
2018-12-17 20:56:38,780 INFO  BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1588)) - BLOCK* neededReplications = 0, pendingReplications = 0.
2018-12-17 20:56:38,901 INFO  ipc.Server (Server.java:saslProcess(1538)) - Auth successful for jhs/prdhdpmn1.example.com@example.COM (auth:KERBEROS)
2018-12-17 20:56:38,902 INFO  authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(137)) - Authorization successful for jhs/prdhdpmn1.example.com@example.COM (auth:KERBEROS) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol
2018-12-17 20:56:41,780 INFO  BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1588)) - BLOCK* neededReplications = 0, pendingReplications = 0.
2018-12-17 20:56:42,500 INFO  ipc.Server (Server.java:saslProcess(1538)) - Auth successful for ambari-qa-ABC@example.COM (auth:KERBEROS)
2018-12-17 20:56:42,501 INFO  authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(137)) - Authorization successful for ambari-qa-ABC@example.COM (auth:KERBEROS) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol
2018-12-17 20:56:42,574 INFO  namenode.FSEditLog (FSEditLog.java:printStatistics(716)) - Number of transactions: 2 Total time for transactions(ms): 2 Number of transactions batched in Syncs: 0 Number of syncs: 1 SyncTimes(ms): 2 5
2018-12-17 20:56:43,687 INFO  ipc.Server (Server.java:saslProcess(1538)) - Auth successful for user1@example.COM (auth:KERBEROS)
2018-12-17 20:56:43,687 INFO  authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(137)) - Authorization successful for user1@example.COM (auth:KERBEROS) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol
2018-12-17 20:56:44,781 INFO  BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1588)) - BLOCK* neededReplications = 0, pendingReplications = 0.
2018-12-17 20:56:47,781 INFO  BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1588)) - BLOCK* neededReplications = 0, pendingReplications = 0.
2018-12-17 20:56:48,574 INFO  client.QuorumJournalManager (QuorumCall.java:waitFor(136)) - Waited 6001 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
2018-12-17 20:56:49,576 INFO  client.QuorumJournalManager (QuorumCall.java:waitFor(136)) - Waited 7002 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.






2018-12-17 20:57:00,587 WARN  client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 18013 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
2018-12-17 20:57:01,587 WARN  client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 19014 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
2018-12-17 20:57:02,574 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [<JN1_IP>:8485, <JN2_IP>:8485, <JN2_IP>:8485], stream=QuorumOutputStream starting at txid 62507384))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
        at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
        at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
        at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
        at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:654)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4018)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:1102)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:630)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307)
2018-12-17 20:57:02,574 WARN  client.QuorumJournalManager (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting at txid 62507384
2018-12-17 20:57:02,587 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1
2018-12-17 20:57:02,592 INFO  provider.AuditProviderFactory (AuditProviderFactory.java:run(501)) - ==> JVMShutdownHook.run()
2018-12-17 20:57:02,592 INFO  provider.AuditProviderFactory (AuditProviderFactory.java:run(502)) - JVMShutdownHook: Signalling async audit cleanup to start.
2018-12-17 20:57:02,592 INFO  provider.AuditProviderFactory (AuditProviderFactory.java:run(477)) - RangerAsyncAuditCleanup: Starting cleanup
2018-12-17 20:57:02,592 INFO  provider.AuditProviderFactory (AuditProviderFactory.java:run(506)) - JVMShutdownHook: Waiting up to 30 seconds for audit cleanup to finish.
2018-12-17 20:57:02,593 INFO  queue.AuditBatchQueue (AuditBatchQueue.java:runLogAudit(281)) - Caught exception in consumer thread. Shutdown might be in progress
2018-12-17 20:57:02,593 INFO  queue.AuditBatchQueue (AuditBatchQueue.java:runLogAudit(332)) - Queue is not empty. Will retry. queue.size)=838857, localBatchBuffer.size()=0
2018-12-17 20:57:02,605 INFO  queue.AuditBatchQueue (AuditBatchQueue.java:runLogAudit(332)) - Queue is not empty. Will retry. queue.size)=837857, localBatchBuffer.size()=0

2018-12-17 20:57:12,583 INFO  queue.AuditBatchQueue (AuditBatchQueue.java:runLogAudit(332)) - Queue is not empty. Will retry. queue.size)=11859, localBatchBuffer.size()=0
2018-12-17 20:57:12,592 WARN  util.ShutdownHookManager (ShutdownHookManager.java:run(70)) - ShutdownHook 'JVMShutdownHook' timeout, java.util.concurrent.TimeoutException
java.util.concurrent.TimeoutException
        at java.util.concurrent.FutureTask.get(FutureTask.java:205)
        at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67)
2018-12-17 20:57:12,592 INFO  provider.AuditProviderFactory (AuditProviderFactory.java:run(514)) - JVMShutdownHook: Interrupted while waiting for completion of Async executor!
java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
        at java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:409)
        at org.apache.ranger.audit.provider.AuditProviderFactory$JVMShutdownHook.run(AuditProviderFactory.java:507)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
2018-12-17 20:57:12,592 INFO  provider.AuditProviderFactory (AuditProviderFactory.java:run(516)) - JVMShutdownHook: Interrupting ranger async audit cleanup thread
2018-12-17 20:57:12,592 INFO  provider.AuditProviderFactory (AuditProviderFactory.java:run(518)) - <== JVMShutdownHook.run()
2018-12-17 20:57:12,592 INFO  queue.AuditAsyncQueue (AuditAsyncQueue.java:stop(106)) - Stop called. name=hdfs.async
2018-12-17 20:57:12,593 INFO  queue.AuditAsyncQueue (AuditAsyncQueue.java:stop(110)) - Interrupting consumerThread. name=hdfs.async, consumer=hdfs.async.multi_dest
2018-12-17 20:57:12,593 INFO  provider.AuditProviderFactory (AuditProviderFactory.java:run(481)) - RangerAsyncAuditCleanup: Done cleanup
2018-12-17 20:57:12,593 INFO  provider.AuditProviderFactory (AuditProviderFactory.java:run(470)) - RangerAsyncAuditCleanup: Waiting to audit cleanup start signal
2018-12-17 20:57:12,593 INFO  queue.AuditAsyncQueue (AuditAsyncQueue.java:runLogAudit(155)) - Caught exception in consumer thread. Shutdown might be in progress
2018-12-17 20:57:12,596 INFO  queue.AuditAsyncQueue (AuditAsyncQueue.java:runLogAudit(171)) - Exiting polling loop. name=hdfs.async
2018-12-17 20:57:12,596 INFO  queue.AuditAsyncQueue (AuditAsyncQueue.java:runLogAudit(175)) - Calling to stop consumer. name=hdfs.async, consumer.name=hdfs.async.multi_dest
2018-12-17 20:57:12,596 INFO  namenode.NameNode (LogAdapter.java:info(47)) - SHUTDOWN_MSG:





There were no error in journal node logs.

We could not identify why the QJM failed to get the repose from the journal nodes which caused this failure.

Is there any way to find the root case of this failure ?

4 REPLIES 4

avatar
Explorer

Hi ,

Please check whether connectivity between Name-node and Journal-node is fine?

Try to check heap usage of Journal-node , as your Journal-nodes are way busy.

avatar
Contributor

We did not find any logs about connectivity issue between Name-node and Journal-Node. Every thing fine after restarting the name node services.

Where do I find heap usage by journal-node during this time period?

avatar
New Contributor

Even , we are facing the same issue. Both the Namenodes got failed on both the cluster. 

Any idea,what is going wrong ?

avatar
Moderator

Hello @Atradius ,

 

thank you for reaching out to the Community with your issue of having both NameNodes down.

 

Do you see in your NN log entries like JvmPauseMonitor saying "Detected pause in JVM or host machine" and a value larger than 1000ms, please? It can be an indication that your service is running out of heap. If it is the NameNode, the short-term solution is to increase the heap and restart the service. A long term solution is to identify why did you run out of heap? E.g. do you face with small files issue? Please read article [1] about how to tackle this.

 

Losing quorum might be caused by ZK service issue, when the ZK is not in quorum. Please check the ZK logs as well.

 

Please let us know if you need more input to progress with your investigation.

 

Best regards:

Ferenc

 

[1] https://blog.cloudera.com/small-files-big-foils-addressing-the-associated-metadata-and-application-c...


Ferenc Erdelyi, Technical Solutions Manager

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Learn more about the Cloudera Community: