we have BI tools in which the configuration file has primary namenode and seondary namenode hardcoded. We are seeing a lot of "read operation in standby not allowed errors" - i guess this will occur when client tries to first check which namenode is active and could throw error if it tries to conenct to standby. But we are seeing more than usual when it comes to this error. But jobs are running fine.
I would like to know if it is cluster side issue with namenode failing over too many times? or BI tool config issue where we are hardcoding nn and snn, instead of service name.
That definitely the outcome of hardcoding, your BI tool should use the SERVICE NAME, but I am wondering "nn and snn" isn't really HA in the real sense. Can you confirm you have Active/Standby namenode? 3 zookeepers etc.
Did you configure HA like in the below example?
Could you also share what BI tool and HDP version you are running?
@Geoffrey Shelton Okot
Cluster has HA enabled - with Active and Standby Namenode. The tool that we use is Atscale - in this it says Primary NN and Secondary NN - doesnt have an option for service name.
HDP version 2.6.3
It seems Atscale doesn't have a provision for configuring namespace so your other best option will be Apache Knox a general proxy and load balancer. it supports HA for YARN, HBase, and HDFS so that if one backend server fails, it will begin forwarding requests to the next server in its list in your case Primary to Standby whose failover is managed by zookeeper. However, Knox will only forward requests to one server at a time. Native load balancing in Knox is being worked on in KNOX-843.
In order to perform load balancing today, it necessary to put a load balancer between Knox and backend services for WebHDFS in order to truly balance the load on the back end. The difficult part is configuring the backend service to accept HTTP Kerberos authentication from a server other than its own.
To achieve knox HA you will need at least 2 Knox instances to ensure HA, but I would look at your expected load and make sure the servers that are up during a failure can handle the load. For example, if you have 2 instances of Knox and one fails, can the one left active handle the load. If not, you may need 3 or more Knox instance.
You will have to point your Atscale connect to the know gateway.