Created 04-06-2017 08:54 AM
Hi,
In my HA cluster, both of my namenodes fail to start.
Centos 7.3
Ambari 2.4.2
HDP 2.5.3
Ambari stderr:
2017-04-06 10:49:49,039 - Getting jmx metrics from NN failed. URL: http://master02.mydomain.local:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py", line 38, in get_value_from_jmx _, data, _ = get_user_call_output(cmd, user=run_user, quiet=False) File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py", line 61, in get_user_call_output raise ExecutionFailed(err_msg, code, files_output[0], files_output[1]) ExecutionFailed: Execution of 'curl -s 'http://master02.mydomain.local:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem' 1>/tmp/tmp0CNZmD 2>/tmp/tmpRAZgwz' returned 7. 2017-04-06 10:49:51,041 - Getting jmx metrics from NN failed. URL: http://master03.mydomain.local:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py", line 38, in get_value_from_jmx _, data, _ = get_user_call_output(cmd, user=run_user, quiet=False) File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py", line 61, in get_user_call_output raise ExecutionFailed(err_msg, code, files_output[0], files_output[1]) ExecutionFailed: Execution of 'curl -s 'http://master03.mydomain.local:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem' 1>/tmp/tmp_hLNY7 2>/tmp/tmpoCOTt8' returned 7. ... (tries several times and then) ... Traceback (most recent call last): File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", line 420, in <module> NameNode().execute() File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 280, in execute method(env) File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", line 101, in start upgrade_suspended=params.upgrade_suspended, env=env) File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", line 89, in thunk return fn(*args, **kwargs) File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py", line 184, in namenode if is_this_namenode_active() is False: File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/decorator.py", line 55, in wrapper return function(*args, **kwargs) File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py", line 562, in is_this_namenode_active raise Fail(format("The NameNode {namenode_id} is not listed as Active or Standby, waiting...")) resource_management.core.exceptions.Fail: The NameNode nn1 is not listed as Active or Standby, waiting...
Ambari stdout:
2017-04-06 10:53:20,521 - call returned (255, '17/04/06 10:53:20 INFO ipc.Client: Retrying connect to server: master03.mydomain.local/10.0.109.21:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)\n17/04/06 10:53:20 WARN ipc.Client: Failed to connect to server: master03.mydomain.local/10.0.109.21:8020: retries get failed due to exceeded maximum allowed retries number: 1 2017-04-06 10:53:20,522 - No active NameNode was found after 5 retries. Will return current NameNode HA states
Namenode log:
2017-04-06 10:11:43,561FATALError: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [10.0.109.20:8485, 10.0.109.21:8485, 10.0.109.22:8485], stream=null)) java.lang.AssertionError: Decided to synchronize log to startTxId: 1 endTxId: 1 isInProgress: true but logger 10.0.109.20:8485 had seen txid 1865764 committed at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnclosedSegment(QuorumJournalManager.java:336) at (some class at some other class at ...)
Some more logs from Namenode(s):
2017-04-05 17:13:42,894 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(388)) - Remote journal 10.0.109.22:8 485 failed to write txns 1865765-1865765. Will try to write to this JN again after the next log roll. org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 13 is not the current writer epoch 0 at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:459) at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:351) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:152) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolSer verSideTranslatorPB.java:158) 2017-04-06 10:11:42,380 INFO ipc.Server (Server.java:logException(2401)) - IPC Server handler 72 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 9.1.10.14:37173 Call#2322 Retry#0 org.apache.hadoop.ipc.RetriableException: NameNode still not started at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:2057) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendHeartbeat(NameNodeRpcServer.java:1414) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.sendHeartbeat(DatanodeProtocolServerSideTranslatorPB.java:118) at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:29064) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307) 2017-04-06 10:11:42,390 INFO namenode.NameNode (NameNode.java:startCommonServices(876)) - NameNode RPC up at: bigm02.etstur.local/9.1.10.21:8020 2017-04-06 10:11:42,391 INFO namenode.FSNamesystem (FSNamesystem.java:startStandbyServices(1286)) - Starting services required for standby state 2017-04-06 10:11:42,393 INFO ha.EditLogTailer (EditLogTailer.java:<init>(117)) - Will roll logs on active node at bigm03.etstur.local/9.1.10.22:8020 every 120 seconds. 2017-04-06 10:11:42,397 INFO ha.StandbyCheckpointer (StandbyCheckpointer.java:start(129)) - Starting standby checkpoint thread... Checkpointing active NN at http://bigm03.etstur.local:50070 Serving checkpoints at http://bigm02.etstur.local:50070 2017-04-06 10:11:43,371 INFO namenode.FSNamesystem (FSNamesystem.java:stopStandbyServices(1329)) - Stopping services started for standby state 2017-04-06 10:11:43,372 WARN ha.EditLogTailer (EditLogTailer.java:doWork(349)) - Edit log tailer interrupted java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:347) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:284) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:301) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:476) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:297) 2017-04-06 10:11:43,475 INFO namenode.FSNamesystem (FSNamesystem.java:startActiveServices(1130)) - Starting services required for active state 2017-04-06 10:11:43,485 INFO client.QuorumJournalManager (QuorumJournalManager.java:recoverUnfinalizedSegments(435)) - Starting recovery process for unclosed journal segments... 2017-04-06 10:11:43,534 INFO client.QuorumJournalManager (QuorumJournalManager.java:recoverUnfinalizedSegments(437)) - Successfully started new epoch 17 2017-04-06 10:11:43,535 INFO client.QuorumJournalManager (QuorumJournalManager.java:recoverUnclosedSegment(263)) - Beginning recovery of unclosed segment starting at txid 1 2017-04-06 10:11:43,557 INFO client.QuorumJournalManager (QuorumJournalManager.java:recoverUnclosedSegment(272)) - Recovery prepare phase complete. Responses: 9.1.10.20:8485: segmentState { startTxId: 1 endTxId: 1 isInProgress: true } lastWriterEpoch: 14 lastCommittedTxId: 1865764 9.1.10.21:8485: segmentState { startTxId: 1 endTxId: 1 isInProgress: true } lastWriterEpoch: 14 lastCommittedTxId: 1865764 2017-04-06 10:11:43,560 INFO client.QuorumJournalManager (QuorumJournalManager.java:recoverUnclosedSegment(296)) - Using longest log: 9.1.10.20:8485=segmentState { startTxId: 1 endTxId: 1 isInProgress: true } lastWriterEpoch: 14 lastCommittedTxId: 1865764 2017-04-06 10:11:43,561 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [9.1.10.20:8485, 9.1.10.21:8485, 9.1.10.22:8485], stream=null)) java.lang.AssertionError: Decided to synchronize log to startTxId: 1 endTxId: 1 isInProgress: true but logger 9.1.10.20:8485 had seen txid 1865764 committed at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnclosedSegment(QuorumJournalManager.java:336) at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnfinalizedSegments(QuorumJournalManager.java:455) at org.apache.hadoop.hdfs.server.namenode.JournalSet$8.apply(JournalSet.java:624) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393) at org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:621) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1459) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1139) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1915) at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61) at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49) at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1783) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1631) at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107) at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307) 2017-04-06 10:11:43,562 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1 2017-04-06 10:11:43,563 INFO namenode.NameNode (LogAdapter.java:info(47)) - SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at bigm02.etstur.local/9.1.10.21 ************************************************************/
And although Journal Nodes started succesfully, they have following error which also can be found suspicious:
2017-04-05 17:15:05,653 ERROR RECEIVED SIGNAL 15: SIGTERM
Funny thing is order of execution. There are ZKFailoverController process, JournalNode(s) and Namenode. I always thougth that Namenode is the latest one to start and which of them would be active was already determined by ZKFailoverController process. As I follow the logs, both of the namenodes check each other. What is this for? Even ZKFailoverController wants to reach Namenodes before itself gets started.
And in case you want to know, the backgroud of this error is as following...
Yesterday I noticed that one of the datanodes failed and stopped. There was following errors in the logs:
2017-04-05 15:50:11,168 ERROR datanode.DataNode (BPServiceActor.java:run(752)) - Initialization failed for Block pool <registering> (Datanode Uuid be2286f5-00d7-4758-b89a-45e2304cabe3) service to master02.mydomain.local/10.0.109.23:8020. Exiting. java.io.IOException: All specified directories are failed to load. at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:596) at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1483) at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1448) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:319) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:267) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:740) at java.lang.Thread.run(Thread.java:745) 2017-04-05 15:50:11,168 ERROR datanode.DataNode (BPServiceActor.java:run(752)) - Initialization failed for Block pool <registering> (Datanode Uuid be2286f5-00d7-4758-b89a-45e2304cabe3) service to master02.mydomain.local/10.0.109.23:8020. Exiting. org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 13, volumes configured: 14, volumes failed: 1, volume failures tolerated: 0 2017-04-05 17:15:36,968 INFO common.Storage (Storage.java:tryLock(774)) - Lock on /grid/13/hadoop/hdfs/data/in_use.lock acquired by nodename 31353@data02.mydomain.local
Although seeing volume errors, I was able to browse /grid/13/
So I wanted to try following answers in this stackoverflow question:
http://stackoverflow.com/questions/22316187/datanode-not-starts-correctly
First I deleted data folder under /grid/13/hadoop/hdfs (/grid/13/hadoop/hdfs/data) and tried to start datanode.
It failed again with same errors so I went with namenode format. My cluster was new and empty so I am fine with any solution including formats:
./hdfs namenode -format -clusterId <myClusterId>
After this format, one of the namenodes failed. I tried to restart all HDFS services then both of namenodes failed.
Any comments appreciated.
Created 04-11-2017 12:36 PM
Apparently my previous answer was just a workound.
After the format it looked fine but an alarm definition on Ambari checked for checkpoints and alerted for it.
Alert description was saying that checkpoint was not made for a long time (it alerts after 6 hours by default but was more than 60 hours in my case).
So what happened was; I did format namenodes but QJM was not informed with it. Thus, txid of QJM was still the old value while namenodes had different. As a result QJM did not accept new edits from active namenode, stand by namenode couldn't get any edits from QJM.
After some more searches on google I found below beautiful article:
https://nicholasmaillard.wordpress.com/2015/07/20/formatting-hdfs/
Article describes right steps to format hdfs for HA enabled clusters as below:
"The initial steps are very close
"
These steps helped me overcome the issue.
Created 04-07-2017 11:14 AM
It seems like all the numbers are messed up now.
While not sure I think epoch numbers are:
nn1: 21
nn2: 13
JQM: 0
I think I executed format command in a very critical moment.
And yet there are multiple txn id(s)
JQM(s) says:
Client trying to move committed txid backward from 1865764 to 1
One of the namenodes say:
2017-04-07 13:54:05,483 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: recoverUnfin alizedSegments failed for required journal (JournalAndStream(mgr=QJM to [10.0.109.20:8485, 10.0.109.21:8485, 10.0.109.22:8485] , stream=null)) java.lang.AssertionError: Decided to synchronize log to startTxId: 1 endTxId: 1 isInProgress: true but logger 10.0.109:8485 had seen txid 1865764 committed
I need you guys' help.
Since cluster does not have any data yet, any solution including formatting any data and / or resetting any module is fine.
Created 04-07-2017 12:42 PM
Now, I am able to start all services.
I checked for the namenode files under namenode data directory. ( <dfs.namenode.name.dir> of hdfs-site.xml file). Can also be found on config tab if you are using Ambari.
I checked the file sizes and modified dates of fsimage and edit log files for both namenode hosts.
The command hdfs runs over hdfs client and it runs fine regardless of being on the same host with namenode or not, a.f.a.i.k. Although I know this I did run the command on the host which one of the namenodes installed.
After checking namenode files, I realised that format did not affect the other namenode.
So I tried to run the command also on this node and then started the services.
This way namenodes started.
Created 04-11-2017 12:36 PM
Apparently my previous answer was just a workound.
After the format it looked fine but an alarm definition on Ambari checked for checkpoints and alerted for it.
Alert description was saying that checkpoint was not made for a long time (it alerts after 6 hours by default but was more than 60 hours in my case).
So what happened was; I did format namenodes but QJM was not informed with it. Thus, txid of QJM was still the old value while namenodes had different. As a result QJM did not accept new edits from active namenode, stand by namenode couldn't get any edits from QJM.
After some more searches on google I found below beautiful article:
https://nicholasmaillard.wordpress.com/2015/07/20/formatting-hdfs/
Article describes right steps to format hdfs for HA enabled clusters as below:
"The initial steps are very close
"
These steps helped me overcome the issue.
Created 04-22-2021 11:57 PM
Solution provided can be worked on new environment,
If its old environment, then issue can be addressed as below,
NN1 - Started - But its state always in staring - Leave as is and go to NN2
NN2 - Start NN2 - This start will fix this issue