About sedatkestepe

sedatkestepe · ‎04-11-2017

Apparently my previous answer was just a workound. After the format it looked fine but an alarm definition on Ambari checked for checkpoints and alerted for it. Alert description was saying that checkpoint was not made for a long time (it alerts after 6 hours by default but was more than 60 hours in my case). So what happened was; I did format namenodes but QJM was not informed with it. Thus, txid of QJM was still the old value while namenodes had different. As a result QJM did not accept new edits from active namenode, stand by namenode couldn't get any edits from QJM. After some more searches on google I found below beautiful article: https://nicholasmaillard.wordpress.com/2015/07/20/formatting-hdfs/ Article describes right steps to format hdfs for HA enabled clusters as below: "The initial steps are very close Stop the Hdfs service Start only the journal nodes (as they will need to be made aware of the formatting) On the first namenode (as user hdfs) hadoop namenode -format hdfs namenode -initializeSharedEdits -force (for the journal nodes) hdfs zkfc -formatZK -force (to force zookeeper to reinitialise) restart that first namenode On the second namenode hdfs namenode -bootstrapStandby -force (force synch with first namenode) On every datanode clear the data directory Restart the HDFS service " These steps helped me overcome the issue.

sedatkestepe · ‎04-07-2017

Now, I am able to start all services. I checked for the namenode files under namenode data directory. ( <dfs.namenode.name.dir> of hdfs-site.xml file). Can also be found on config tab if you are using Ambari. I checked the file sizes and modified dates of fsimage and edit log files for both namenode hosts. The command hdfs runs over hdfs client and it runs fine regardless of being on the same host with namenode or not, a.f.a.i.k. Although I know this I did run the command on the host which one of the namenodes installed. After checking namenode files, I realised that format did not affect the other namenode. So I tried to run the command also on this node and then started the services. This way namenodes started.

sedatkestepe · ‎04-07-2017

It seems like all the numbers are messed up now. While not sure I think epoch numbers are: nn1: 21 nn2: 13 JQM: 0 I think I executed format command in a very critical moment. And yet there are multiple txn id(s) JQM(s) says: Client trying to move committed txid backward from 1865764 to 1 One of the namenodes say: 2017-04-07 13:54:05,483 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: recoverUnfin alizedSegments failed for required journal (JournalAndStream(mgr=QJM to [10.0.109.20:8485, 10.0.109.21:8485, 10.0.109.22:8485] , stream=null)) java.lang.AssertionError: Decided to synchronize log to startTxId: 1 endTxId: 1 isInProgress: true but logger 10.0.109:8485 had seen txid 1865764 committed I need you guys' help. Since cluster does not have any data yet, any solution including formatting any data and / or resetting any module is fine.

sedatkestepe · ‎04-06-2017

Hi, In my HA cluster, both of my namenodes fail to start. Centos 7.3 Ambari 2.4.2 HDP 2.5.3 Ambari stderr: 2017-04-06 10:49:49,039 - Getting jmx metrics from NN failed. URL: http://master02.mydomain.local:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py", line 38, in get_value_from_jmx _, data, _ = get_user_call_output(cmd, user=run_user, quiet=False) File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py", line 61, in get_user_call_output raise ExecutionFailed(err_msg, code, files_output[0], files_output[1]) ExecutionFailed: Execution of 'curl -s 'http://master02.mydomain.local:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem' 1>/tmp/tmp0CNZmD 2>/tmp/tmpRAZgwz' returned 7. 2017-04-06 10:49:51,041 - Getting jmx metrics from NN failed. URL: http://master03.mydomain.local:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py", line 38, in get_value_from_jmx _, data, _ = get_user_call_output(cmd, user=run_user, quiet=False) File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py", line 61, in get_user_call_output raise ExecutionFailed(err_msg, code, files_output[0], files_output[1]) ExecutionFailed: Execution of 'curl -s 'http://master03.mydomain.local:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem' 1>/tmp/tmp_hLNY7 2>/tmp/tmpoCOTt8' returned 7. ... (tries several times and then) ... Traceback (most recent call last): File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", line 420, in <module> NameNode().execute() File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 280, in execute method(env) File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", line 101, in start upgrade_suspended=params.upgrade_suspended, env=env) File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", line 89, in thunk return fn(*args, **kwargs) File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py", line 184, in namenode if is_this_namenode_active() is False: File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/decorator.py", line 55, in wrapper return function(*args, **kwargs) File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py", line 562, in is_this_namenode_active raise Fail(format("The NameNode {namenode_id} is not listed as Active or Standby, waiting...")) resource_management.core.exceptions.Fail: The NameNode nn1 is not listed as Active or Standby, waiting... Ambari stdout: 2017-04-06 10:53:20,521 - call returned (255, '17/04/06 10:53:20 INFO ipc.Client: Retrying connect to server: master03.mydomain.local/10.0.109.21:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)\n17/04/06 10:53:20 WARN ipc.Client: Failed to connect to server: master03.mydomain.local/10.0.109.21:8020: retries get failed due to exceeded maximum allowed retries number: 1 2017-04-06 10:53:20,522 - No active NameNode was found after 5 retries. Will return current NameNode HA states Namenode log: 2017-04-06 10:11:43,561FATALError: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [10.0.109.20:8485, 10.0.109.21:8485, 10.0.109.22:8485], stream=null)) java.lang.AssertionError: Decided to synchronize log to startTxId: 1 endTxId: 1 isInProgress: true but logger 10.0.109.20:8485 had seen txid 1865764 committed at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnclosedSegment(QuorumJournalManager.java:336) at (some class at some other class at ...) Some more logs from Namenode(s): 2017-04-05 17:13:42,894 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(388)) - Remote journal 10.0.109.22:8 485 failed to write txns 1865765-1865765. Will try to write to this JN again after the next log roll. org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 13 is not the current writer epoch 0 at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:459) at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:351) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:152) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolSer verSideTranslatorPB.java:158) 2017-04-06 10:11:42,380 INFO ipc.Server (Server.java:logException(2401)) - IPC Server handler 72 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 9.1.10.14:37173 Call#2322 Retry#0 org.apache.hadoop.ipc.RetriableException: NameNode still not started at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:2057) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendHeartbeat(NameNodeRpcServer.java:1414) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.sendHeartbeat(DatanodeProtocolServerSideTranslatorPB.java:118) at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:29064) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307) 2017-04-06 10:11:42,390 INFO namenode.NameNode (NameNode.java:startCommonServices(876)) - NameNode RPC up at: bigm02.etstur.local/9.1.10.21:8020 2017-04-06 10:11:42,391 INFO namenode.FSNamesystem (FSNamesystem.java:startStandbyServices(1286)) - Starting services required for standby state 2017-04-06 10:11:42,393 INFO ha.EditLogTailer (EditLogTailer.java:<init>(117)) - Will roll logs on active node at bigm03.etstur.local/9.1.10.22:8020 every 120 seconds. 2017-04-06 10:11:42,397 INFO ha.StandbyCheckpointer (StandbyCheckpointer.java:start(129)) - Starting standby checkpoint thread... Checkpointing active NN at http://bigm03.etstur.local:50070 Serving checkpoints at http://bigm02.etstur.local:50070 2017-04-06 10:11:43,371 INFO namenode.FSNamesystem (FSNamesystem.java:stopStandbyServices(1329)) - Stopping services started for standby state 2017-04-06 10:11:43,372 WARN ha.EditLogTailer (EditLogTailer.java:doWork(349)) - Edit log tailer interrupted java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:347) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:284) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:301) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:476) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:297) 2017-04-06 10:11:43,475 INFO namenode.FSNamesystem (FSNamesystem.java:startActiveServices(1130)) - Starting services required for active state 2017-04-06 10:11:43,485 INFO client.QuorumJournalManager (QuorumJournalManager.java:recoverUnfinalizedSegments(435)) - Starting recovery process for unclosed journal segments... 2017-04-06 10:11:43,534 INFO client.QuorumJournalManager (QuorumJournalManager.java:recoverUnfinalizedSegments(437)) - Successfully started new epoch 17 2017-04-06 10:11:43,535 INFO client.QuorumJournalManager (QuorumJournalManager.java:recoverUnclosedSegment(263)) - Beginning recovery of unclosed segment starting at txid 1 2017-04-06 10:11:43,557 INFO client.QuorumJournalManager (QuorumJournalManager.java:recoverUnclosedSegment(272)) - Recovery prepare phase complete. Responses: 9.1.10.20:8485: segmentState { startTxId: 1 endTxId: 1 isInProgress: true } lastWriterEpoch: 14 lastCommittedTxId: 1865764 9.1.10.21:8485: segmentState { startTxId: 1 endTxId: 1 isInProgress: true } lastWriterEpoch: 14 lastCommittedTxId: 1865764 2017-04-06 10:11:43,560 INFO client.QuorumJournalManager (QuorumJournalManager.java:recoverUnclosedSegment(296)) - Using longest log: 9.1.10.20:8485=segmentState { startTxId: 1 endTxId: 1 isInProgress: true } lastWriterEpoch: 14 lastCommittedTxId: 1865764 2017-04-06 10:11:43,561 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [9.1.10.20:8485, 9.1.10.21:8485, 9.1.10.22:8485], stream=null)) java.lang.AssertionError: Decided to synchronize log to startTxId: 1 endTxId: 1 isInProgress: true but logger 9.1.10.20:8485 had seen txid 1865764 committed at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnclosedSegment(QuorumJournalManager.java:336) at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnfinalizedSegments(QuorumJournalManager.java:455) at org.apache.hadoop.hdfs.server.namenode.JournalSet$8.apply(JournalSet.java:624) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393) at org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:621) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1459) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1139) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1915) at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61) at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49) at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1783) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1631) at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107) at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307) 2017-04-06 10:11:43,562 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1 2017-04-06 10:11:43,563 INFO namenode.NameNode (LogAdapter.java:info(47)) - SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at bigm02.etstur.local/9.1.10.21 ************************************************************/ And although Journal Nodes started succesfully, they have following error which also can be found suspicious: 2017-04-05 17:15:05,653 ERROR RECEIVED SIGNAL 15: SIGTERM Funny thing is order of execution. There are ZKFailoverController process, JournalNode(s) and Namenode. I always thougth that Namenode is the latest one to start and which of them would be active was already determined by ZKFailoverController process. As I follow the logs, both of the namenodes check each other. What is this for? Even ZKFailoverController wants to reach Namenodes before itself gets started. And in case you want to know, the backgroud of this error is as following... Yesterday I noticed that one of the datanodes failed and stopped. There was following errors in the logs: 2017-04-05 15:50:11,168 ERROR datanode.DataNode (BPServiceActor.java:run(752)) - Initialization failed for Block pool <registering> (Datanode Uuid be2286f5-00d7-4758-b89a-45e2304cabe3) service to master02.mydomain.local/10.0.109.23:8020. Exiting. java.io.IOException: All specified directories are failed to load. at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:596) at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1483) at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1448) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:319) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:267) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:740) at java.lang.Thread.run(Thread.java:745) 2017-04-05 15:50:11,168 ERROR datanode.DataNode (BPServiceActor.java:run(752)) - Initialization failed for Block pool <registering> (Datanode Uuid be2286f5-00d7-4758-b89a-45e2304cabe3) service to master02.mydomain.local/10.0.109.23:8020. Exiting. org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 13, volumes configured: 14, volumes failed: 1, volume failures tolerated: 0 2017-04-05 17:15:36,968 INFO common.Storage (Storage.java:tryLock(774)) - Lock on /grid/13/hadoop/hdfs/data/in_use.lock acquired by nodename 31353@data02.mydomain.local Although seeing volume errors, I was able to browse /grid/13/ So I wanted to try following answers in this stackoverflow question: http://stackoverflow.com/questions/22316187/datanode-not-starts-correctly First I deleted data folder under /grid/13/hadoop/hdfs (/grid/13/hadoop/hdfs/data) and tried to start datanode. It failed again with same errors so I went with namenode format. My cluster was new and empty so I am fine with any solution including formats: ./hdfs namenode -format -clusterId <myClusterId> After this format, one of the namenodes failed. I tried to restart all HDFS services then both of namenodes failed. Any comments appreciated.

sedatkestepe · ‎04-04-2017

Answer appeared on second comment. Thanks @Arpit Agarwal ! By the way, port number is 1022 On /etc/hadoop/conf/hdfs-site.xml file <property> <name>dfs.datanode.http.address</name> <value>0.0.0.0:1022</value> </property> Yes, my environment is Kerberos enabled. It was already represented on Namenode UI > Datanodes tab > "Http Address" column.

sedatkestepe · ‎04-03-2017

Hi @Arpit Agarwal, Thank you for your answer. Actually, you are right. I checked that by a friend's installation and page is there. Unfortunately that URL is not available in my environment. This site can’t be reached 10.0.109.23 refused to connect. Search Google for 109 50075 ERR_CONNECTION_REFUSED I realised an anormal situation on my environment. I got two datanode instances running (Dproc_datanode) one started by root and other by hdfs. It should be hdfs so I killed the one started by root. Expected to see the page but it wasn't there. I checked the service on Ambari but it seems like Ambari manages the process started by root : It was stopped. I need to investigate more what is going on 😕

sedatkestepe · ‎04-03-2017

Hi, Is jmx info pages of datanodes deprecated? I remember like there used to be pages on http:<datanode>:50070/ I also see similar tips on this page for datanode (50075): https://www.datadoghq.com/blog/collecting-hadoop-metrics/ On HDP 2.5.3 I can't see them. Do I remember wrong? Is it something I can enable-disable?

sedatkestepe · ‎03-20-2017

Worked! It is the upper case! Thank you @Robert Levas!

sedatkestepe · ‎03-16-2017

Follow up question... Reverting AD Realm name seems to be a long process and risky. Can anybody confirm @Robert Levas ? Does anybody have comments on this? The thing is; I didn't see it was mentioned anywhere in guides but see it first time here. I even read that host names and domain names should be in lower case in a Hortonworks blog. https://hortonworks.com/blog/enabling-kerberos-hdp-active-directory-integration/ " DOMAIN_REALM provides a translation from a domain name or hostname to a Kerberos realm name. The tag name can be a host name, or a domain name, where domain names are indicated by a prefix of a period (.). Host names and domain names should be in lower case. "

sedatkestepe · ‎03-15-2017

@Robert Levas thanks for the answer. I'll try that tomorrow morning (in 12 hours) There was a setting before starting kerberization: Under Advanced kerberos-env section: "Enable case insensitive username rules" I have checked this one so thougth it would be enough. Isn't it? If so, does it mean that we cannot connect an HDP cluster to an Active Directory if AD realm was defined with lower case letters?

Online	Offline
Last Visited	‎02-13-2017 03:42 AM

Member Since	‎01-25-2017 11:56 PM
Last Visited	‎02-13-2017 03:42 AM
Posts	119
Kudos received	7

Cloudera Community

Re: Namenodes fail starting on HA cluster - Fatals...

Re: No services listed on Ambari cluster installat...

Re: Namenodes fail starting on HA cluster - Fatals...

Re: Namenodes fail starting on HA cluster - Fatals...

Re: Namenodes fail starting on HA cluster - Fatals...

Namenodes fail starting on HA cluster - Fatals exi...

Re: Hadoop datanode jmx metrics pages

Re: Hadoop datanode jmx metrics pages

Hadoop datanode jmx metrics pages

Re: Kerberos Integration to Active Directory faile...

Re: Kerberos Integration to Active Directory faile...

Re: Kerberos Integration to Active Directory faile...