Support Questions

Find answers, ask questions, and share your expertise

Process to Start StandBy NameNode

avatar
Explorer

Hi everyone,

 

My system have 2 datanode, 2  namenode, 3 journalnode, 3 zookeeper service 

 

I had config cluster namenode ok , when browsing the admin page namenode:50070 , I had see 1 name node status (active) and one namenode status (standby). => OK 

 

When I stop active namenode the other with standby become active .  => OK

 

But the problem is how start the namenode which I had stop again ?

 

I do the following :

 

sudo -u hdfs hdfs namenode -bootstrapStandby -force

/etc/init.d/hadoop-hdfs-namenode start

With above process sometime namenode start ok with standby mode , but sometime it start with active mode and then I have 2 active node (split brain !!)

 

So what I have wrong , what is the right process to start a namenode had stop again

 

Thanks you

 

 

 

 

 

1 ACCEPTED SOLUTION

avatar
Mentor
The fencing config requirement still exists, and you could configure a valid fencer if you wish to, but with Journal Nodes involved you can simply use the following as your fencer, as the QJMs fence the NameNodes by crashing them due to a single elected writer model:

<property>
<name>dfs.ha.fencing.methods</name>
<value>shell(/bin/true)</value>
</property>

View solution in original post

8 REPLIES 8

avatar
Mentor

A HA HDFS installation requires you to run Failover Controllers on each of
the NameNode, along with a ZooKeeper service. These controllers take care
of transitioning NameNodes such that only one is active and the other
becomes standby.

It appears that you're using a CDH package based (non-CM) installation
here, so please follow the guide starting at
https://www.cloudera.com/documentation/enterprise/5-14-x/topics/cdh_hag_hdfs_ha_intro.html#topic_2_1...,
following instructions that are under the 'Command-Line' parts instead of
Cloudera Manager ones.

 


@phaothu wrote:

But the problem is how start the namenode which I had stop again ?

 

I do the following :

 

sudo -u hdfs hdfs namenode -bootstrapStandby -force

/etc/init.d/hadoop-hdfs-namenode start

With above process sometime namenode start ok with standby mode , but sometime it start with active mode and then I have 2 active node (split brain !!)

 

So what I have wrong , what is the right process to start a namenode had stop again

 

 


Just simply start it up. The bootstrap command must only be run if its a fresh new NameNode, not every restart of a previously running NameNode.

 

Its worth noting that Standby and Active are just states of the very same NameNode. The StandbyNameNode is not a special daemon, its just a state of the NameNode.

avatar
Champion

@phaothu

 

To do via CM, 

 

Login as admin to CM -> HDFS -> Instances -> 'Federation and high availability' button -> Action -> Manual Failover

avatar
Explorer

Thanks @saranvisa but I am  not using  CM , just install by command line.

avatar
Mentor
@phaothu,

> My system have 2 datanode, 2 namenode, 3 journalnode, 3 zookeeper service

To repeat, you need to run the ZKFailoverController daemons in addition to this setup. Please see the guide linked in my previous post and follow it entirely for the command-line setup.

Running just ZK will not grant you a HDFS HA solution - you are missing a crucial daemon that interfaces between ZK and HDFS.

avatar
Explorer

Dear @Harsh J ,

 

 

Does 'Automatic Failover Configuration' need config 'Fencing Configuration' , It is   2 dependent section or I need both to config  Automatic Failover .

 

Because I met this error

 

ou must configure a fencing method before using automatic failover.
org.apache.hadoop.ha.BadFencingConfigurationException: No fencer configured for NameNode at node1/x.x.x.x:8020
	at org.apache.hadoop.hdfs.tools.NNHAServiceTarget.checkFencingConfigured(NNHAServiceTarget.java:132)
	at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:225)
	at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:60)
	at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:171)
	at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:167)
	at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:444)
	at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:167)
	at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:192)
2018-09-10 15:56:53,262 INFO org.apache.zookeeper.ZooKeeper: Session: 0x365c2b22a1e0000 closed
2018-09-10 15:56:53,262 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down

 

If I need both of them , so 

 

<property>
  <name>dfs.ha.fencing.methods</name>
  <value>shell(/path/to/my/script.sh --nameservice=$target_nameserviceid $target_host:$target_port)</value>
</property>

What is the 

/path/to/my/script.sh

 

The contain of this script I am not clear about this , pls explain and may be give me an example .

 

Thanks you  

 

avatar
Mentor
The fencing config requirement still exists, and you could configure a valid fencer if you wish to, but with Journal Nodes involved you can simply use the following as your fencer, as the QJMs fence the NameNodes by crashing them due to a single elected writer model:

<property>
<name>dfs.ha.fencing.methods</name>
<value>shell(/bin/true)</value>
</property>

avatar
Explorer

@Harsh J yeap with 

 

<property>
  <name>dfs.ha.fencing.methods</name>
  <value>shell(/bin/true)</value>
</property>

It working perfect now 

 

Thanks you very much 

avatar
Explorer

Yeap as @Harsh J said I am using a CDH package based (non-CM) installation.  I will show more about my config 

I have 3 nodes: node 1 , node 2 , node 3 

 

zookeeper in 3 nodes: 3 nodes with the same config

 

maxClientCnxns=50
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/var/lib/zookeeper
clientPort=2181
dataLogDir=/var/lib/zookeeper

 

 

hdfs-site.xml in 3 nodes:

 

<property>
  <name>dfs.nameservices</name>
  <value>mycluster</value>
</property>

<property>
  <name>dfs.ha.namenodes.mycluster</name>
  <value>node1,node2</value>
</property>

<property>
  <name>dfs.namenode.rpc-address.mycluster.node1</name>
  <value>node1:8020</value>
</property>

<property>
  <name>dfs.namenode.rpc-address.mycluster.node2</name>
  <value>node2:8020</value>
</property>


<property>
  <name>dfs.namenode.http-address.mycluster.node1</name>
  <value>node1:50070</value>
</property>
<property>
  <name>dfs.namenode.http-address.mycluster.node2</name>
  <value>node2:50070</value>
</property>


<property>
  <name>dfs.namenode.shared.edits.dir</name>
  <value>qjournal://node1:8485;node2:8485;node3:8485/mycluster</value>
</property>

<property>
  <name>dfs.client.failover.proxy.provider.mycluster</name>
  <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>

<property>
  <name>dfs.ha.automatic-failover.enabled</name>
  <value>true</value>
</property>

<property>
  <name>dfs.journalnode.edits.dir</name>
  <value>/namenode/dfs/jn</value>
</property>

core-site.xml in 3 nodes:

 

 

<property>
  <name>fs.defaultFS</name>
  <value>hdfs://mycluster</value>
</property>

<property>
  <name>ha.zookeeper.quorum</name>
  <value>node1:2181,node2:2181,node3:2181</value>
</property>

 

Above is config relate to cluster namnode. I doupt  about zookeeper config , does is it enough ? 

 

Service install on each node

Node 1: 

hadoop-hdfs-journalnode  hadoop-hdfs-namenode    hadoop-hdfs-zkfc  zookeeper-server

Node 2:

hadoop-hdfs-journalnode  hadoop-hdfs-namenode    hadoop-hdfs-zkfc  zookeeper-server

Node 3: 

hadoop-hdfs-journalnode  zookeeper-server

 

The firt time initial is ok: 

Node 1 active , Node 2 standby

 

Stop namenode service on Node 1 => node 2 active => OK 

 

But when start service NameNode on Node 1 again 

 

node 1 active And node 2 active too => fail