Support Questions

Find answers, ask questions, and share your expertise

Question on hdfs automatic failover

avatar
Master Guru

Hey Guys,

Consider below scenario:

1. I have NN HA configured on my cluster

2. I have configured ssh fencing

3. My active NN went down and automated failover did not work

4. I had to failover manually using -forcemanual flag.

What fencing method we can use so that in case of power failure/physical server crash/OS reboot there would be automated failover? is it possible ?

1 ACCEPTED SOLUTION

avatar
Super Guru

In this case you need to configure two fencing methods and your last method should give success always so that automatic failover can happen successfully.

Please refer below link.

https://www.packtpub.com/books/content/setting-namenode-ha

View solution in original post

7 REPLIES 7

avatar
Master Mentor
@Kuldeep Kulkarni

https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.ht...

You can try this

shell - run an arbitrary shell command to fence the Active NameNode

The shell fencing method runs an arbitrary shell command. It may be configured like so:

    <property>
      <name>dfs.ha.fencing.methods</name>
      <value>shell(/path/to/my/script.sh arg1 arg2 ...)</value>
    </property>

The string between ‘(’ and ‘)’ is passed directly to a bash shell and may not include any closing parentheses.

The shell command will be run with an environment set up to contain all of the current Hadoop configuration variables, with the ‘_’ character replacing any ‘.’ characters in the configuration keys. The configuration used has already had any namenode-specific configurations promoted to their generic forms – for example dfs_namenode_rpc-address will contain the RPC address of the target node, even though the configuration may specify that variable asdfs.namenode.rpc-address.ns1.nn1.

You can write your own custom scripts. Also, you can check with the OS vendor like "https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Suite_Overview/s2-fencing-overview-CSO.html"

avatar
Master Guru

@Neeraj Sabharwal - Thank you!

avatar
Super Guru

In this case you need to configure two fencing methods and your last method should give success always so that automatic failover can happen successfully.

Please refer below link.

https://www.packtpub.com/books/content/setting-namenode-ha

avatar
Master Guru

avatar
Master Guru

avatar
Master Guru

One little extra comment: You do not need any fencing method for the failover. The QJM and zookeeper quorums make sure only the active namenode can write to the fsimage. However it is possible that a zombie active namenode might still give outdated read-only requests to connected clients.

That's where fencing comes in. However if configured the active namenode will wait for the fencing method to return success. So you need to be sure that your method does not block (by configuring a timeout in your ssh action for example ) and that in the end it returns success. I.e. either use a script that returns success in any case or have multiple non-blocking methods that end with one that returns true in any case

avatar
Master Guru

@Benjamin Leonhardi - Thank you!