Created 06-22-2016 11:38 PM
In the process of exploring HDFS HA with ZKFC, I noticed 'dfs.ha.fencing.methods' is configured as 'shell(/bin/true)'. Would anyone explain what's the purpose of this conf? As a bonus, it's better to highlight high level failover flow within which how this conf is applied? Thanks.
Created 06-23-2016 11:35 PM
Hi @Xiaobing Zhou, I think the requirement to have shell(/bin/true) which is essentially a no-op fencer can be eliminated. There is no technical reason to require the no-op fencer.
The code that instantiates a fencer is in NodeFencer.java
public static NodeFencer create(Configuration conf, String confKey) throws BadFencingConfigurationException { String confStr = conf.get(confKey); if (confStr == null) { return null; } return new NodeFencer(conf, confStr); }
A potential improvement is to instantiate a dummy fencer if dfs.ha.fencing.methods is undefined i.e. the confStr == null case above.
Created 06-23-2016 12:55 AM
There are 2 methods for fencing. shell and ssh. In your example shell fencing is used. this command will always return true and fencing will happen if there is an issue with the current active NN. for ssh fence, you need to setup passwordless ssh from active to standby and vice varsa.
Please read more about fencing at below link (refer dfs.ha.fencing.methods)
Created 06-23-2016 01:46 PM
Here is why we need always true fencing as a second option.
This is done for workaround cases where the primary NameNode machine goes down and the ssh method will fail, and no failover will be performed. We want to avoid this, so the second option would be to failover anyway, even without fencing, which, as already mentioned, is safe with our setup. To achieve this, we specify two fencing methods, which will be tried by ZKFC in the order of: if the first one fails, the second one will be tried. In our case, the second one will always return success and failover will be initiated, even if the server running the primary NameNode is not available via ssh.
We have tested this approach and it worked fine specially when Primary NN host down due to major hardware/power failure. Ref. https://www.packtpub.com/books/content/setting-namenode-ha
Created 06-23-2016 11:35 PM
Hi @Xiaobing Zhou, I think the requirement to have shell(/bin/true) which is essentially a no-op fencer can be eliminated. There is no technical reason to require the no-op fencer.
The code that instantiates a fencer is in NodeFencer.java
public static NodeFencer create(Configuration conf, String confKey) throws BadFencingConfigurationException { String confStr = conf.get(confKey); if (confStr == null) { return null; } return new NodeFencer(conf, confStr); }
A potential improvement is to instantiate a dummy fencer if dfs.ha.fencing.methods is undefined i.e. the confStr == null case above.