Support Questions

Find answers, ask questions, and share your expertise

Active Namenode unable to restart/start in high availability mode if standby node is not responding?

avatar
Explorer

I have a 3 node cluster with high availability for Name Node. When I shut down one of the two machine having name node instance and trying to restart active name node it failed with error Getting jmx metrics from NN failed. When debugging i noticed that the start script make jmx request from each name node to get state of the node multiple times and finally end with error python script has been killed due to timeout after waiting 1800 secs

7 REPLIES 7

avatar
@Vinay Khandelwal

re-start Zookeeper and give a try again.

avatar
Explorer

@Divakar Annapureddy it isn't working. I think the issue is "NameNode HA states: active_namenodes = [], standby_namenodes = [(u'nn2', 'node2:50070')], unknown_namenodes = [(u'nn1', 'node1:50070')]".

When i manually shut down node1 the states of Nnamenode's doesn't change, as i seen in start logs.

avatar

@Vinay Khandelwal - Namenode HA state is maintained by the Zkfc server running on the Namenode hosts. Can you please answer below questions:

  • When you say "shut down one of the two machine", do you mean you only shutdown the Namenode or the entire machine.
  • Are you shutting those down Zkfc servers well.
  • Also, are you trying to restart using Ambari or Command Line and what is the HDP version you are using.

And if possible can you please post Namenode logs.

Thanks

avatar
Explorer

@Namit Maheshwari As you asked:

  • I have shut down the machine manually to test the high availability of cluster.
  • The Zkfc server on that machine will automatically shut down if i shut down the machine.
  • I am trying to restart name node or history server from oozie web ui, and i am using HDP-2.5

I think you are asking for Namenode start logs:

2017-03-30 20:50:20,863 - Waiting for the NameNode to broadcast whether it is Active or Standby...
2017-03-30 20:50:20,866 - call['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'curl -s '"'"'http://node1:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'"'"' 1>/tmp/tmpZwMQaz 2>/tmp/tmp8u5jXA''] {'quiet': False}
2017-03-30 20:52:28,369 - call returned (7, '')
2017-03-30 20:52:28,370 - Getting jmx metrics from NN failed. URL: http://node1:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py", line 38, in get_value_from_jmx
    _, data, _ = get_user_call_output(cmd, user=run_user, quiet=False)
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py", line 61, in get_user_call_output
    raise Fail(err_msg)
Fail: Execution of 'curl -s 'http://node1:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem' 1>/tmp/tmpZwMQaz 2>/tmp/tmp8u5jXA' returned 7. 

2017-03-30 20:52:28,371 - call['hdfs haadmin -ns NameNodeURI -getServiceState nn1'] {'logoutput': True, 'user': 'hdfs'}
17/03/30 20:52:50 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 0 time(s); maxRetries=45
17/03/30 20:53:10 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 1 time(s); maxRetries=45
17/03/30 20:53:30 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 2 time(s); maxRetries=45
.
.
17/03/30 21:07:11 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 43 time(s); maxRetries=45
17/03/30 21:07:31 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 44 time(s); maxRetries=45
Operation failed: Call From node2/10.10.2.82 to node1:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=node1/10.10.2.81:8020]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout
2017-03-30 21:07:51,283 - call returned (255, '17/03/30 20:52:50 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 0 time(s); maxRetries=45\n17/03/30 20:53:10 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 1 time(s); maxRetries=45\n17/03/30 20:53:30 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 2 time(s); maxRetries=45\n17/03/30 20:53:50 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 3 time(s); maxRetries=45\n17/03/30 20:54:10 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 4 time(s); maxRetries=45\n17/03/30
.
.
.
21:07:11 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 43 time(s); maxRetries=45\n17/03/30 21:07:31 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 44 time(s); maxRetries=45\nOperation failed: Call From node2/10.10.2.82 to node1:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=node1/10.10.2.81:8020]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout')
2017-03-30 21:07:51,284 - call['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'curl -s '"'"'http://node2:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'"'"' 1>/tmp/tmpYw7oMN 2>/tmp/tmpk0tfnh''] {'quiet': False}
2017-03-30 21:07:51,544 - call returned (0, '')
2017-03-30 21:07:51,547 - NameNode HA states: active_namenodes = [], standby_namenodes = [(u'nn2', 'node2:50070')], unknown_namenodes = [(u'nn1', 'node1:50070')]
2017-03-30 21:07:51,547 - Will retry 4 time(s), caught exception: No active NameNode was found.. Sleeping for 5 sec(s)

And this will got repeated and start script timedout

avatar

@Vinay Khandelwal - When you are shutting down the machine, the NameNode along with the Zkfc server will go down. The other NameNode will automatically failover to become the Active NameNode. There is no restart required here.

Other question I have for you is how many DataNodes you have on your 3 node cluster. Was there a Datanode running on the host you shutdown as well?

avatar
Explorer

@Namit Maheshwari - Yeah in default case it wouldn't require restart. But i want to manually restart it actually i have two instances of history server and managing them using an external monitoring service so that's why i require a restart at least of history server when the node which is running history server goes down.

I have two DataNodes in clutser and yes one is running on host which i want to shutdown.

avatar

Can you try restarting the Namenode and Zkfc from command line:

su -l hdfs -c "/usr/hdp/current/hadoop-hdfs-namenode/../hadoop/sbin/hadoop-daemon.sh stop namenode"
su -l hdfs -c "/usr/hdp/current/hadoop-hdfs-namenode/../hadoop/sbin/hadoop-daemon.sh stop zkfc"

su -l hdfs -c "/usr/hdp/current/hadoop-hdfs-namenode/../hadoop/sbin/hadoop-daemon.sh start zkfc"
su -l hdfs -c "/usr/hdp/current/hadoop-hdfs-namenode/../hadoop/sbin/hadoop-daemon.sh start namenode"

And then paste the error