Support Questions
Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

how to recover the standby name node in ambari cluster

39927-capture.png

39928-capture.png

we have ambari cluster version Hadoop 2.6

we have 3 master machines and 2 workers ( all then on redhat version 7 )

we noticed that the standby name-node on master01 is stooped ( from ambari GUI )

and when we start the (standby name node) it its fail as the following:

tail -f  /var/log/hadoop/hdfs/hadoop-hdfs-namenode-master01.com.log
ERROR namenode.NameNode (NameNode.java:main(1774)) - Failed to start 
namenode.
org.apache.hadoop.hdfs.server.namenode.EditLogInputException: Error replaying 
edit log at offset 0.  Expected transaction ID was 13361263
the recommendation is to do the following steps in order to start the  standby name node on master01 machine ( on master02 the name-node ruining ok )

1. su - hdfs
2. hdfs dfsadmin -safemode leave ( on master02 )
3. cp -rp /hadoop/hdfs/journal/hdfsha/current /hadoop/hdfs/journal/hdfsha/current.orig ( on master02 )
4. rm -f /hadoop/hdfs/journal/hdfsha/current/* ( on master02 )
5. hdfs namenode -bootstrapStandby ( on master01 )


because this is production system!
we need to know if the steps (1-5) are risky ? or are safe because we backup the current folder


second

in which case we can use the command: ( is it usful in our case ? ) 

hadoop namenode -recover 


<br>
Michael-Bronson
16 REPLIES 16

Super Mentor

@uri ben-ari
I think we can try this:

1). First Check on the master02 (Which is currently Active) , that if the safe mode is OFF (and running fine without any issues/errors). Also it will be good to do "saveNameSpace"

[root@test1 ~]# su - hdfs 

[hdfs@test1 ~]$ hdfs dfsadmin -safemode get
Safe mode is OFF in blueprint.example.com/172.36.130.138:8020
Safe mode is OFF in blueprint1.example.com/172.36.130.139:8020



2). If the Active NameNode is not in SafeMode then we can simply run the bootstrap command on the StandbyNameNode (master01) and then try starting it.

[root@test2 ~]# su - hdfs 
[hdfs@test2 ~]#  hdfs namenode -bootstrapStandby

.

when we run it on the active node ( now it is master03 we get this - master03 ~]$ hdfs dfsadmin -safemode get safemode: Call From master03.fg.com/10.14.28.18 to master01.fg.com:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused

Michael-Bronson

and when I run the - hdfs namenode -bootstrapStandby ( on the standby name node ) we get this quastion Re-format filesystem in Storage Directory /var/hadoop/hdfs/namenode ? (Y or N) , can I type Y on all the quastions ?

Michael-Bronson

Hi jay - its finger mistake I write master02 insted master03

Michael-Bronson

Super Mentor

@uri ben-ari

Before that we should do "saveNamespace" on the Working Active NameNode.

# hdfs dfsadmin -fs hdfs://<hostname>:8020 -saveNamespace         (on good NameNode ) 

.

And then Yes, we need to enter the input as 'Y" when asked.

# hdfs namenode -bootstrapStandby ( on the standby name node )
Storage Directory /var/hadoop/hdfs/namenode ? (Y or N)   Y

.

Reference :

https://community.hortonworks.com/content/supportkb/48989/how-to-bootstrap-standby-namenode.html

Super Mentor

@uri ben-ari
-saveNamespace => Save current namespace into storage directories and reset edits log. Requires safe mode.
-bootstrapStandby => Allows the standby NameNode’s storage directories to be bootstrapped by copying the latest namespace snapshot from the active NameNode.

hi Jay - on the active namenode we do - hdfs dfsadmin -fs hdfs://master02:8020 -saveNamespace saveNamespace: Safe mode should be turned ON in order to create namespace image ( so on the next step we turn it ON )

Michael-Bronson

hdfs dfsadmin -safemode enter safemode: Call From master02.pp.com/10.14.38.18 to master01.pp.com:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused ( why we get this ? )

Michael-Bronson

Super Mentor

@uri ben-ari

I am really confused between master01, master02, master03.

I was thinking master02 --> is currently Active.

master01 --> Was Standby.

But one of your previous comment says "active node ( now it is master03"

when we run it on the active node ( now it is master03 we get this - 
master03 ~]$ hdfs dfsadmin -safemode get safemode: Call From master03.fg.com/10.14.28.18 to master01.fg.com:8020 failed on connection exception: 

.

How can the NameNode HA be there in 3 NameNodes (masters).

hi Jay , yes now the active nodename is master03 and not master02 , let me summary master01 is the standby and master03 is the active

Michael-Bronson

hi Jay if you want - I can open another quastion and summary all the details so it will be more clear

Michael-Bronson

Super Mentor

@uri ben-ari

I got the current scenario:

master03 => Active NN

master01 => StandBy NN

But the strange thing i see in your output that it still shows "master02" & "master01" which i was not expecting. How come "master02" is listed there.

hdfs dfsadmin -safemode enter 
safemode: Call From master02.pp.com/10.14.38.18 to master01.pp.com:8020 failed

.

Are you sure that you are running the command from correct hosts (not from master02 as currently it does not have any NameNode)

Super Mentor

@uri ben-ari

At this point i am not sure why does it show "master02.pp.com/10.14.38.18"

But it will be good to take a look at your "core-site.xml" and "hdfs-site.xml" to see if it has any incorrect entries.

yes I am sure that I am run it on master03 ( the active ) and the standby is master01 , master02 not have the namenode ( I just change the IP and domain because I not allow to print the real info )

Michael-Bronson

Jay let me know if you need more info?

Michael-Bronson

Hi jay - this is finger mistake I write master02 insted master03

Michael-Bronson