we have ambari cluster version Hadoop 2.6
we have 3 master machines and 2 workers ( all then on redhat version 7 )
we noticed that the standby name-node on master01 is stooped ( from ambari GUI )
and when we start the (standby name node) it its fail as the following:
tail -f /var/log/hadoop/hdfs/hadoop-hdfs-namenode-master01.com.log ERROR namenode.NameNode (NameNode.java:main(1774)) - Failed to start namenode. org.apache.hadoop.hdfs.server.namenode.EditLogInputException: Error replaying edit log at offset 0. Expected transaction ID was 13361263
the recommendation is to do the following steps in order to start the standby name node on master01 machine ( on master02 the name-node ruining ok ) 1. su - hdfs 2. hdfs dfsadmin -safemode leave ( on master02 ) 3. cp -rp /hadoop/hdfs/journal/hdfsha/current /hadoop/hdfs/journal/hdfsha/current.orig ( on master02 ) 4. rm -f /hadoop/hdfs/journal/hdfsha/current/* ( on master02 ) 5. hdfs namenode -bootstrapStandby ( on master01 ) because this is production system! we need to know if the steps (1-5) are risky ? or are safe because we backup the current folder second in which case we can use the command: ( is it usful in our case ? ) hadoop namenode -recover <br>
I think we can try this:
1). First Check on the master02 (Which is currently Active) , that if the safe mode is OFF (and running fine without any issues/errors). Also it will be good to do "saveNameSpace"
[root@test1 ~]# su - hdfs [hdfs@test1 ~]$ hdfs dfsadmin -safemode get Safe mode is OFF in blueprint.example.com/220.127.116.11:8020 Safe mode is OFF in blueprint1.example.com/18.104.22.168:8020
2). If the Active NameNode is not in SafeMode then we can simply run the bootstrap command on the StandbyNameNode (master01) and then try starting it.
[root@test2 ~]# su - hdfs [hdfs@test2 ~]# hdfs namenode -bootstrapStandby
when we run it on the active node ( now it is master03 we get this - master03 ~]$ hdfs dfsadmin -safemode get safemode: Call From master03.fg.com/10.14.28.18 to master01.fg.com:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
and when I run the - hdfs namenode -bootstrapStandby ( on the standby name node ) we get this quastion Re-format filesystem in Storage Directory /var/hadoop/hdfs/namenode ? (Y or N) , can I type Y on all the quastions ?
Before that we should do "saveNamespace" on the Working Active NameNode.
# hdfs dfsadmin -fs hdfs://<hostname>:8020 -saveNamespace (on good NameNode )
And then Yes, we need to enter the input as 'Y" when asked.
# hdfs namenode -bootstrapStandby ( on the standby name node ) Storage Directory /var/hadoop/hdfs/namenode ? (Y or N) Y
-saveNamespace => Save current namespace into storage directories and reset edits log. Requires safe mode.
-bootstrapStandby => Allows the standby NameNode’s storage directories to be bootstrapped by copying the latest namespace snapshot from the active NameNode.
hi Jay - on the active namenode we do - hdfs dfsadmin -fs hdfs://master02:8020 -saveNamespace saveNamespace: Safe mode should be turned ON in order to create namespace image ( so on the next step we turn it ON )
hdfs dfsadmin -safemode enter safemode: Call From master02.pp.com/10.14.38.18 to master01.pp.com:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused ( why we get this ? )
I am really confused between master01, master02, master03.
I was thinking master02 --> is currently Active.
master01 --> Was Standby.
But one of your previous comment says "active node ( now it is master03"
when we run it on the active node ( now it is master03 we get this - master03 ~]$ hdfs dfsadmin -safemode get safemode: Call From master03.fg.com/10.14.28.18 to master01.fg.com:8020 failed on connection exception:
How can the NameNode HA be there in 3 NameNodes (masters).
hi Jay , yes now the active nodename is master03 and not master02 , let me summary master01 is the standby and master03 is the active
I got the current scenario:
master03 => Active NN
master01 => StandBy NN
But the strange thing i see in your output that it still shows "master02" & "master01" which i was not expecting. How come "master02" is listed there.
hdfs dfsadmin -safemode enter safemode: Call From master02.pp.com/10.14.38.18 to master01.pp.com:8020 failed
Are you sure that you are running the command from correct hosts (not from master02 as currently it does not have any NameNode)
At this point i am not sure why does it show "master02.pp.com/10.14.38.18"
But it will be good to take a look at your "core-site.xml" and "hdfs-site.xml" to see if it has any incorrect entries.
yes I am sure that I am run it on master03 ( the active ) and the standby is master01 , master02 not have the namenode ( I just change the IP and domain because I not allow to print the real info )