Thanks for reply.
We have issues with our ResourceManager or NameNode, whose failover controller is too sensitive. Our ZooKeeper cluster will sometimes hang for seconds due to high load. We would like to let the failover controller be more tolerant to transient ZooKeeper faillure by increasing number of retries and retry interval. Do you have any advice?