Support Questions

Find answers, ask questions, and share your expertise

What's purpose of shell(/bin/true) in HDFS HA fencer?

avatar
Explorer

In the process of exploring HDFS HA with ZKFC, I noticed 'dfs.ha.fencing.methods' is configured as 'shell(/bin/true)'. Would anyone explain what's the purpose of this conf? As a bonus, it's better to highlight high level failover flow within which how this conf is applied? Thanks.

1 ACCEPTED SOLUTION

avatar

Hi @Xiaobing Zhou, I think the requirement to have shell(/bin/true) which is essentially a no-op fencer can be eliminated. There is no technical reason to require the no-op fencer.

The code that instantiates a fencer is in NodeFencer.java

public static NodeFencer create(Configuration conf, String confKey)
    throws BadFencingConfigurationException {
  String confStr = conf.get(confKey);
  if (confStr == null) {
    return null;
  }
  return new NodeFencer(conf, confStr);
}

A potential improvement is to instantiate a dummy fencer if dfs.ha.fencing.methods is undefined i.e. the confStr == null case above.

View solution in original post

3 REPLIES 3

avatar
Master Guru
@Xiaobing Zhou

There are 2 methods for fencing. shell and ssh. In your example shell fencing is used. this command will always return true and fencing will happen if there is an issue with the current active NN. for ssh fence, you need to setup passwordless ssh from active to standby and vice varsa.

Please read more about fencing at below link (refer dfs.ha.fencing.methods)

https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.ht...

avatar
Super Guru

@Xiaobing Zhou

Here is why we need always true fencing as a second option.

This is done for workaround cases where the primary NameNode machine goes down and the ssh method will fail, and no failover will be performed. We want to avoid this, so the second option would be to failover anyway, even without fencing, which, as already mentioned, is safe with our setup. To achieve this, we specify two fencing methods, which will be tried by ZKFC in the order of: if the first one fails, the second one will be tried. In our case, the second one will always return success and failover will be initiated, even if the server running the primary NameNode is not available via ssh.

We have tested this approach and it worked fine specially when Primary NN host down due to major hardware/power failure. Ref. https://www.packtpub.com/books/content/setting-namenode-ha

avatar

Hi @Xiaobing Zhou, I think the requirement to have shell(/bin/true) which is essentially a no-op fencer can be eliminated. There is no technical reason to require the no-op fencer.

The code that instantiates a fencer is in NodeFencer.java

public static NodeFencer create(Configuration conf, String confKey)
    throws BadFencingConfigurationException {
  String confStr = conf.get(confKey);
  if (confStr == null) {
    return null;
  }
  return new NodeFencer(conf, confStr);
}

A potential improvement is to instantiate a dummy fencer if dfs.ha.fencing.methods is undefined i.e. the confStr == null case above.