Created on 10-24-2017 02:58 PM - edited 08-17-2019 06:15 PM
we have ambari cluster version Hadoop 2.6
we have 3 master machines and 2 workers ( all then on redhat version 7 )
we noticed that the standby name-node on master01 is stooped ( from ambari GUI )
and when we start the (standby name node) it its fail as the following:
tail -f /var/log/hadoop/hdfs/hadoop-hdfs-namenode-master01.com.log ERROR namenode.NameNode (NameNode.java:main(1774)) - Failed to start namenode. org.apache.hadoop.hdfs.server.namenode.EditLogInputException: Error replaying edit log at offset 0. Expected transaction ID was 13361263
the recommendation is to do the following steps in order to start the standby name node on master01 machine ( on master02 the name-node ruining ok ) 1. su - hdfs 2. hdfs dfsadmin -safemode leave ( on master02 ) 3. cp -rp /hadoop/hdfs/journal/hdfsha/current /hadoop/hdfs/journal/hdfsha/current.orig ( on master02 ) 4. rm -f /hadoop/hdfs/journal/hdfsha/current/* ( on master02 ) 5. hdfs namenode -bootstrapStandby ( on master01 ) because this is production system! we need to know if the steps (1-5) are risky ? or are safe because we backup the current folder second in which case we can use the command: ( is it usful in our case ? ) hadoop namenode -recover <br>
Created 10-25-2017 11:20 AM
@uri ben-ari
I think we can try this:
1). First Check on the master02 (Which is currently Active) , that if the safe mode is OFF (and running fine without any issues/errors). Also it will be good to do "saveNameSpace"
[root@test1 ~]# su - hdfs [hdfs@test1 ~]$ hdfs dfsadmin -safemode get Safe mode is OFF in blueprint.example.com/172.36.130.138:8020 Safe mode is OFF in blueprint1.example.com/172.36.130.139:8020
2). If the Active NameNode is not in SafeMode then we can simply run the bootstrap command on the StandbyNameNode (master01) and then try starting it.
[root@test2 ~]# su - hdfs [hdfs@test2 ~]# hdfs namenode -bootstrapStandby
.
Created 10-25-2017 11:49 AM
when we run it on the active node ( now it is master03 we get this - master03 ~]$ hdfs dfsadmin -safemode get safemode: Call From master03.fg.com/10.14.28.18 to master01.fg.com:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
Created 10-25-2017 11:54 AM
and when I run the - hdfs namenode -bootstrapStandby ( on the standby name node ) we get this quastion Re-format filesystem in Storage Directory /var/hadoop/hdfs/namenode ? (Y or N) , can I type Y on all the quastions ?
Created 10-26-2017 07:20 PM
Hi jay - its finger mistake I write master02 insted master03
Created 10-26-2017 01:55 PM
Before that we should do "saveNamespace" on the Working Active NameNode.
# hdfs dfsadmin -fs hdfs://<hostname>:8020 -saveNamespace (on good NameNode )
.
And then Yes, we need to enter the input as 'Y" when asked.
# hdfs namenode -bootstrapStandby ( on the standby name node ) Storage Directory /var/hadoop/hdfs/namenode ? (Y or N) Y
.
Reference :
https://community.hortonworks.com/content/supportkb/48989/how-to-bootstrap-standby-namenode.html
Created 10-26-2017 01:57 PM
@uri ben-ari
-saveNamespace => Save current namespace into storage directories and reset edits log. Requires safe mode.
-bootstrapStandby => Allows the standby NameNode’s storage directories to be bootstrapped by copying the latest namespace snapshot from the active NameNode.
Created 10-26-2017 02:59 PM
hi Jay - on the active namenode we do - hdfs dfsadmin -fs hdfs://master02:8020 -saveNamespace saveNamespace: Safe mode should be turned ON in order to create namespace image ( so on the next step we turn it ON )
Created 10-26-2017 03:30 PM
hdfs dfsadmin -safemode enter safemode: Call From master02.pp.com/10.14.38.18 to master01.pp.com:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused ( why we get this ? )
Created 10-26-2017 04:27 PM
I am really confused between master01, master02, master03.
I was thinking master02 --> is currently Active.
master01 --> Was Standby.
But one of your previous comment says "active node ( now it is master03"
when we run it on the active node ( now it is master03 we get this - master03 ~]$ hdfs dfsadmin -safemode get safemode: Call From master03.fg.com/10.14.28.18 to master01.fg.com:8020 failed on connection exception:
.
How can the NameNode HA be there in 3 NameNodes (masters).
Created 10-26-2017 05:39 PM
hi Jay , yes now the active nodename is master03 and not master02 , let me summary master01 is the standby and master03 is the active
Created 10-26-2017 05:57 PM
hi Jay if you want - I can open another quastion and summary all the details so it will be more clear
Created 10-26-2017 06:03 PM
I got the current scenario:
master03 => Active NN
master01 => StandBy NN
But the strange thing i see in your output that it still shows "master02" & "master01" which i was not expecting. How come "master02" is listed there.
hdfs dfsadmin -safemode enter safemode: Call From master02.pp.com/10.14.38.18 to master01.pp.com:8020 failed
.
Are you sure that you are running the command from correct hosts (not from master02 as currently it does not have any NameNode)
Created 10-26-2017 06:49 PM
At this point i am not sure why does it show "master02.pp.com/10.14.38.18"
But it will be good to take a look at your "core-site.xml" and "hdfs-site.xml" to see if it has any incorrect entries.
Created 10-26-2017 06:28 PM
yes I am sure that I am run it on master03 ( the active ) and the standby is master01 , master02 not have the namenode ( I just change the IP and domain because I not allow to print the real info )
Created 10-26-2017 06:47 PM
Jay let me know if you need more info?
Created 10-26-2017 07:22 PM
Hi jay - this is finger mistake I write master02 insted master03