Created 03-30-2017 05:09 PM
I have a 3 node cluster with high availability for Name Node. When I shut down one of the two machine having name node instance and trying to restart active name node it failed with error Getting jmx metrics from NN failed. When debugging i noticed that the start script make jmx request from each name node to get state of the node multiple times and finally end with error python script has been killed due to timeout after waiting 1800 secs
Created 03-30-2017 05:37 PM
re-start Zookeeper and give a try again.
Created 03-31-2017 08:10 AM
@Divakar Annapureddy it isn't working. I think the issue is "NameNode HA states: active_namenodes = [], standby_namenodes = [(u'nn2', 'node2:50070')], unknown_namenodes = [(u'nn1', 'node1:50070')]".
When i manually shut down node1 the states of Nnamenode's doesn't change, as i seen in start logs.
Created 03-30-2017 07:55 PM
@Vinay Khandelwal - Namenode HA state is maintained by the Zkfc server running on the Namenode hosts. Can you please answer below questions:
And if possible can you please post Namenode logs.
Thanks
Created 03-31-2017 07:07 AM
@Namit Maheshwari As you asked:
I think you are asking for Namenode start logs:
2017-03-30 20:50:20,863 - Waiting for the NameNode to broadcast whether it is Active or Standby... 2017-03-30 20:50:20,866 - call['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'curl -s '"'"'http://node1:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'"'"' 1>/tmp/tmpZwMQaz 2>/tmp/tmp8u5jXA''] {'quiet': False} 2017-03-30 20:52:28,369 - call returned (7, '') 2017-03-30 20:52:28,370 - Getting jmx metrics from NN failed. URL: http://node1:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py", line 38, in get_value_from_jmx _, data, _ = get_user_call_output(cmd, user=run_user, quiet=False) File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py", line 61, in get_user_call_output raise Fail(err_msg) Fail: Execution of 'curl -s 'http://node1:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem' 1>/tmp/tmpZwMQaz 2>/tmp/tmp8u5jXA' returned 7. 2017-03-30 20:52:28,371 - call['hdfs haadmin -ns NameNodeURI -getServiceState nn1'] {'logoutput': True, 'user': 'hdfs'} 17/03/30 20:52:50 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 0 time(s); maxRetries=45 17/03/30 20:53:10 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 1 time(s); maxRetries=45 17/03/30 20:53:30 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 2 time(s); maxRetries=45 . . 17/03/30 21:07:11 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 43 time(s); maxRetries=45 17/03/30 21:07:31 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 44 time(s); maxRetries=45 Operation failed: Call From node2/10.10.2.82 to node1:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=node1/10.10.2.81:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout 2017-03-30 21:07:51,283 - call returned (255, '17/03/30 20:52:50 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 0 time(s); maxRetries=45\n17/03/30 20:53:10 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 1 time(s); maxRetries=45\n17/03/30 20:53:30 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 2 time(s); maxRetries=45\n17/03/30 20:53:50 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 3 time(s); maxRetries=45\n17/03/30 20:54:10 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 4 time(s); maxRetries=45\n17/03/30 . . . 21:07:11 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 43 time(s); maxRetries=45\n17/03/30 21:07:31 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 44 time(s); maxRetries=45\nOperation failed: Call From node2/10.10.2.82 to node1:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=node1/10.10.2.81:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout') 2017-03-30 21:07:51,284 - call['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'curl -s '"'"'http://node2:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'"'"' 1>/tmp/tmpYw7oMN 2>/tmp/tmpk0tfnh''] {'quiet': False} 2017-03-30 21:07:51,544 - call returned (0, '') 2017-03-30 21:07:51,547 - NameNode HA states: active_namenodes = [], standby_namenodes = [(u'nn2', 'node2:50070')], unknown_namenodes = [(u'nn1', 'node1:50070')] 2017-03-30 21:07:51,547 - Will retry 4 time(s), caught exception: No active NameNode was found.. Sleeping for 5 sec(s) And this will got repeated and start script timedout
Created 03-31-2017 11:59 PM
@Vinay Khandelwal - When you are shutting down the machine, the NameNode along with the Zkfc server will go down. The other NameNode will automatically failover to become the Active NameNode. There is no restart required here.
Other question I have for you is how many DataNodes you have on your 3 node cluster. Was there a Datanode running on the host you shutdown as well?
Created 04-03-2017 06:28 AM
@Namit Maheshwari - Yeah in default case it wouldn't require restart. But i want to manually restart it actually i have two instances of history server and managing them using an external monitoring service so that's why i require a restart at least of history server when the node which is running history server goes down.
I have two DataNodes in clutser and yes one is running on host which i want to shutdown.
Created 04-03-2017 11:48 PM
Can you try restarting the Namenode and Zkfc from command line:
su -l hdfs -c "/usr/hdp/current/hadoop-hdfs-namenode/../hadoop/sbin/hadoop-daemon.sh stop namenode" su -l hdfs -c "/usr/hdp/current/hadoop-hdfs-namenode/../hadoop/sbin/hadoop-daemon.sh stop zkfc" su -l hdfs -c "/usr/hdp/current/hadoop-hdfs-namenode/../hadoop/sbin/hadoop-daemon.sh start zkfc" su -l hdfs -c "/usr/hdp/current/hadoop-hdfs-namenode/../hadoop/sbin/hadoop-daemon.sh start namenode"
And then paste the error