Support Questions

Find answers, ask questions, and share your expertise

two name nodes are stand by after configuring HA

avatar
Rising Star

i have configured high availability in my cluster which consists of three nodes

hadoop-master(192.168.4.128)(name node)

hadoop-slave-1(192.168.4.111) (another name node )

hadoop-slave-2 (192.168.4.106) (data node)

without formatting name node ( converting a non-HA-enabled cluster to be HA-enabled) as described here https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.ht...

but i got two name nodes working as standby so i tried to move the transition of one of these two nodes to active by applying the following command

 hdfs haadmin -transitionToActive mycluster --forcemanual

with the following out put

17/04/03 08:07:35 WARN ha.HAAdmin: Proceeding with manual HA state management even though
automatic failover is enabled for NameNode at hadoop-master/192.168.4.128:8020
17/04/03 08:07:36 WARN ha.HAAdmin: Proceeding with manual HA state management even though
automatic failover is enabled for NameNode at hadoop-slave-1/192.168.4.111:8020
Illegal argument: Unable to determine service address for namenode 'mycluster'

my core-site is

<property>
                 <name>dfs.tmp.dir</name>
                 <value>/opt/hadoop/data15</value>
       </property>
        <property>
           <name>fs.default.name</name>
           <value>hdfs://hadoop-master:8020</value>
       </property>
       <property>
           <name>dfs.permissions</name>
           <value>false</value>
       </property>
       <property>
           <name>dfs.journalnode.edits.dir</name>
           <value>/usr/local/journal/node/local/data</value>
       </property>

        <property>

                <name>fs.defaultFS</name>

                <value>hdfs://mycluster</value>

        </property>
        <property>

                <name>hadoop.tmp.dir</name>

                <value>/tmp</value>
  </property>

my hdfs-site.xml is

<property>
                 <name>dfs.replication</name>
                 <value>2</value>
        </property>
        <property>
                 <name>dfs.name.dir</name>
                 <value>/opt/hadoop/data16</value>
                 <final>true</final>
        </property>
        <property>
                 <name>dfs.data.dir</name>
                 <value>/opt/hadoop/data17</value>
                 <final>true</final>
        </property>

        <property>
                <name>dfs.webhdfs.enabled</name>
                <value>true</value>
        </property>
        <property>
                <name>dfs.namenode.secondary.http-address</name>
                <value>hadoop-slave-1:50090</value>
        </property>

       <property>

        <name>dfs.nameservices</name>

        <value>mycluster</value>

        <final>true</final>

    </property>

    <property>

        <name>dfs.ha.namenodes.mycluster</name>

        <value>hadoop-master,hadoop-slave-1</value>

        <final>true</final>

    </property>

    <property>

        <name>dfs.namenode.rpc-address.mycluster.hadoop-master</name>

        <value>hadoop-master:8020</value>

    </property>

    <property>

        <name>dfs.namenode.rpc-address.mycluster.hadoop-slave-1</name>

        <value>hadoop-slave-1:8020</value>

    </property>

    <property>

        <name>dfs.namenode.http-address.mycluster.hadoop-master</name>

        <value>hadoop-master:50070</value>

    </property>

    <property>

        <name>dfs.namenode.http-address.mycluster.hadoop-slave-1</name>

        <value>hadoop-slave-1:50070</value>

    </property>

    <property>

        <name>dfs.namenode.shared.edits.dir</name>

        <value>qjournal://hadoop-master:8485;hadoop-slave-2:8485;hadoop-slave-1:8485/mycluster</value>

    </property>

    <property>

        <name>dfs.ha.automatic-failover.enabled</name>

        <value>true</value>

    </property>

    <property>

        <name>ha.zookeeper.quorum</name>
        <value>hadoop-master:2181,hadoop-slave-1:2181,hadoop-slave-2:2181</value>

    </property>

    <property>

        <name>dfs.ha.fencing.methods</name>

        <value>sshfence</value>

    </property>

    <property>

        <name>dfs.ha.fencing.ssh.private-key-files</name>

        <value>root/.ssh/id_rsa</value>

    </property>
    <property>

        <name>dfs.ha.fencing.ssh.connect-timeout</name>

        <value>3000</value>

    </property>

what should the service address value be ? and what are possible solutions i can apply in order to turn on one name node of the two nodes to active state ?

note the zookeeper server on all three nodes is stopped

10 REPLIES 10

avatar
Rising Star

You need to start zookeeper server in order to make ZKFailover controller up. ZKFailover controller is the one who manages the active and standby state of namenode.

avatar
Rising Star

even though i started zookeper server and i get a leader mode in one of two namenodes and follower mode in the other name node and data node, i still get same problem that both of two name nodes are stand by ,also there are no log files under log directory that is configured in zoo.cfg ,so i can't know zoo keeper errors but i think when .zkServer.sh status gives a status(followe or leader) it indicates that every thing with zookeeper is all right isn't it ?

avatar
Rising Star

running ./zkCli on two name nodes shows the same error

Welcome to ZooKeeper! JLine support is enabled [zk: localhost:2181(CONNECTING) 0] 2017-04-03 09:57:34,141 [myid:] - INFO [main-SendThread(127.0.0.1:2181):ClientCnxn$SendThread@1032] - Opening socket connection to server 127.0.0.1/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) 2017-04-03 09:57:34,148 [myid:] - WARN [main-SendThread(127.0.0.1:2181):ClientCnxn$SendThread@1162] - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)

avatar
Rising Star

running ./zkCli on both namenodes shows same error

Welcome to ZooKeeper! JLine support is enabled [zk: localhost:2181(CONNECTING) 0] 2017-04-03 09:57:34,141 [myid:] - INFO [main-SendThread(127.0.0.1:2181):ClientCnxn$SendThread@1032] - Opening socket connection to server 127.0.0.1/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) 2017-04-03 09:57:34,148 [myid:] - WARN [main-SendThread(127.0.0.1:2181):ClientCnxn$SendThread@1162] - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)

avatar

Are you using HDP and did you enable NameNode HA using Ambari? If so then you should have automatic failover configured. Automatic Failover requires the ZooKeeper service instances and ZooKeeper FailoverControllers to be up and running.

If you setup HA manually, then you may need to transition one of the NNs to active status manually as described here:

https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.ht...

avatar
Rising Star

iam using hadoop apache 2.7.1 and i have followed the link you applied

https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.ht...

and finally tried to force one of the two name nodes to be active manually by applying

hdfs haadmin -transitionToActive hadoop-master

with the following response

  • 17/04/04 03:13:06 WARN ha.HAAdmin: Proceeding with manual HA state management even though
  • automatic failover is enabled for NameNode at hadoop-slave-1/192.168.4.111:8020
  • 17/04/04 03:13:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  • 17/04/04 03:13:07 WARN ha.HAAdmin: Proceeding with manual HA state management even though
  • automatic failover is enabled for NameNode at hadoop-master/192.168.4.128:8020
  • Operation failed: End of File Exception between local host is: "hadoop-master/192.168.4.128"; destination host is: "hadoop-master":8020; : java.io.EOFException; For more details see:http://wiki.apache.org/hadoop/EOFException

what should i do with two stand by name nodes should i apply name node format on one of these two name nodes

avatar

Ok looks like you have automatic failover enabled. I am not sure why you get the EOFException.

Look through your NameNode logs to see if there are any errors.

avatar
New Contributor

Hi All,

I had similar issue, while building new cluster and enabling HA. Both NN were is standby and error in NN log {1}.

Fix was in CM, we need to "Initialize HA state in ZK" under "Federation and High Availability. There then restart cluster.

{1}Caused by: java.net.ConnectException: Call From <NN1> to <NN2>:8022 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused

avatar
Expert Contributor

Hello @kolli_sandeep , it seems the failover controllers are down in the cluster. Please follow the steps here [1] and start the Failover Controller roles which will transition the NameNdoes to Active/Standby state. 

 

You need to follow below steps;

 

  1. Stop the FailoverController Roles under HDFS > Instances page 
  2. Remove the HA state from ZK. On a ZooKeeper server host, run zookeeper-client.
    1. Execute the following to remove the configured nameservice. This example assumes the name of the nameservice is nameservice1. You can identify the nameservice from the Federation and High Availability section on the HDFS Instances tab:
      rmr /hadoop-ha/nameservice1
      (If you don't see any znode /hadoop-ha in ZK znode list, skip the step)

  3. After removing the HA znode in ZK, Go to CM and Click the HDFS > Instances >  Federation and High Availability > Actions  
  4. Under Actions menu, Select Actions > Initialize High Availability State in ZooKeeper.
  5. Then start the Failover Controllers role ( CM > Instances > Select FailoverControllers > Actions for selected > Start)
  6. Verify the NameNdoe State and if you don't see the active/standby state of NN, If any failure, just Restart the HDFS service

[1] https://docs.cloudera.com/documentation/enterprise/latest/topics/cdh_hag_hdfs_ha_enabling.html