Created on 03-15-2016 05:05 AM - edited 08-19-2019 02:54 AM
The critical errors showed up after enabling namenode ha and adding new namenode instance in node1 as shown in pic below. node1:50070 . The cluster has 5 hosts . Using AMbari in this scenario
The problems are namenode UI is not able to connect . The problems started after i enabled namenode ha from active namenode . (and trying to add standby in node1:50070). In the manual steps of namenode ha i didn't realize that i am running cli of physical host node1 instead of physical host namenode and ran the dfsadmin safemode,savenamespace, intialize and other manual steps . While in the process it gave a message to do a namenode format which i did it , and ended up in all steps to complete the wizard. Finally critical errors in alerts section are
a) Standy namenode (node1:50070) starts but active (namenode:50070) does not start . namenode webUIdoesn't open. If i tried to start the node1:50070 namenode in the service menu the this becomes the standby and other namenode:50070 stops and vice versa action. The namenode webUI dropdown shows one as standby and other just the hostname.
b)The mapred2 Service the history server process does not start. I have gone throughthe configurations in yarn log aggregationbut not much help.
What can I do for these .All of above are critical errors after doing Namenode HA.Untill then the cluster ran fine
Please let me know the solutions. Please share your expertise.I am trying too. All these are errors . So how can I do assaign acitve namenode and standby if anything goes wrong in the wizard.
(DoesHdfshaadmin will do the trick)
DFSHAAdmin [-ns <nameserviceId>]
[-failover [--forcefence] [--forceactive] <serviceId> <serviceId>]
In the place of nameserviceId and serviceId what should be values in namenode ha configuration.
@satish k.t - Can you please make current standby namenode as active by below command, atleast your hdfs will be up and running and we can troubleshoot issues with other NN
hdfs haadmin -transitionToActive <service-id-of-standby-NN> --forcemanual
Once current standby becomes active then try to restart problematic NN and let us know how it goes.
ya thanks , disable namenode ha , and revert back to secondary namenode will do fine. But i believe when executing the manual steps in did in node1 instead i had to do those physical steps on host on namenode which is the active and current . SO i believe that is the problem and started all the errors