About selvanand_panne

selvanand_panne · ‎04-22-2016

@Kuldeep Kulkarni @Ravi Mutyala i am not seeing error messages relating to this issue in the nodemanager logfiles. nodemanagerlogs.zip

selvanand_panne · ‎04-22-2016

@Kuldeep Kulkarni Thanks for looking into the issue. yarn.nodemanager.address is set to the default 0.0.0.0 in both nodes. And `hostname` returns short hostname in both the nodes. I tried to work around the issue by hardcoding the hostname variable with short hostname in line# 66 of nodemanager_upgrade.py and the downgrade moved ahead completed fine. I tried upgrading to 2.4.0 and that too completed fine. I am not sure if this workaround has any side effects but smoke testing of the cluster post upgrade was successful. I am still wondering how come nodemanager@node2 was successful the first time since in node2 also the output of "yarn node -list -states=RUNNING" returned the hostnames without FQDN and the upgrade script was looking for host with FQDN.

selvanand_panne · ‎04-21-2016

We are upgrading HDP from 2.3.4 to 2.4.0. by following the instructions in the below link: https://docs.hortonworks.com/HDPDocuments/Ambari-2.2.1.1/bk_upgrading_Ambari/content/_upgrade_ambari.html All the steps in the upgrade document till “4.2 Perform express upgrade” have been completed successfully. During the express upgrade, step “Restarting NodeManager on 2 hosts” fails in 1 host and succeeds in the other. I tried to downgrade but downgrade too failed at the same step: >> On host 1: [yarn@node1 ~]$ yarn node -list -states=RUNNING 16/04/21 13:49:25 INFO impl.TimelineClientImpl: Timeline service address: http://node2.domain.net:8188/ws/v1/timeline/ 16/04/21 13:49:25 INFO client.RMProxy: Connecting to ResourceManager at node2.domain.net/13.111.111.11:8050 Total Nodes:2 Node-Id Node-State Node-Http-Address Number-of-Running-Containers node1:45454 RUNNING node1:8042 0 node2:45454 RUNNING node2:8042 0 Below is the error message I see in the error log: resource_management.core.exceptions.Fail: NodeManager with ID node1.domain.net:45454 was not found in the list of running NodeManagers On host 2: [yarn@node2 sbin]$ yarn node -list -states=RUNNING 16/04/21 13:49:35 INFO impl.TimelineClientImpl: Timeline service address: http://node2.domain.net:8188/ws/v1/timeline/ 16/04/21 13:49:35 INFO client.RMProxy: Connecting to ResourceManager at node2.domain.net/13.111.111.11:8050 Total Nodes:2 Node-Id Node-State Node-Http-Address Number-of-Running-Containers node1:45454 RUNNING node1:8042 0 node2:45454 RUNNING node2:8042 0 NO errors reported while restarting node manager in this server. << Nodemanager status looks exactly same in both nodes but I am not sure why the restart status check fails in one node and not on the other. How to fix this issue? node1-downgrade-log.txtnode2-downgrade-log.txt

Online	Offline
Last Visited	‎06-21-2016 09:04 PM

Member Since	‎04-21-2016 06:52 PM
Last Visited	‎06-21-2016 09:04 PM
Posts	3
Kudos received	2

Cloudera Community

Re: Node manager restart fails during upgrade / do...

Re: Node manager restart fails during upgrade / do...

Node manager restart fails during upgrade / downgr...