Support Questions

selvanand_panne · ‎04-21-2016

We are upgrading HDP from 2.3.4 to 2.4.0. by following the instructions in the below link:

https://docs.hortonworks.com/HDPDocuments/Ambari-2.2.1.1/bk_upgrading_Ambari/content/_upgrade_ambari...

All the steps in the upgrade document till “4.2 Perform express upgrade” have been completed successfully.

During the express upgrade, step “Restarting NodeManager on 2 hosts” fails in 1 host and succeeds in the other. I tried to downgrade but downgrade too failed at the same step:

>>

On host 1:

[yarn@node1 ~]$ yarn node -list -states=RUNNING

16/04/21 13:49:25 INFO impl.TimelineClientImpl: Timeline service address: http://node2.domain.net:8188/ws/v1/timeline/

16/04/21 13:49:25 INFO client.RMProxy: Connecting to ResourceManager at node2.domain.net/13.111.111.11:8050

Total Nodes:2

Node-Id Node-State Node-Http-Address Number-of-Running-Containers

node1:45454 RUNNING node1:8042 0

node2:45454 RUNNING node2:8042 0

Below is the error message I see in the error log:

resource_management.core.exceptions.Fail: NodeManager with ID node1.domain.net:45454 was not found in the list of running NodeManagers

On host 2:

[yarn@node2 sbin]$ yarn node -list -states=RUNNING

16/04/21 13:49:35 INFO impl.TimelineClientImpl: Timeline service address: http://node2.domain.net:8188/ws/v1/timeline/

16/04/21 13:49:35 INFO client.RMProxy: Connecting to ResourceManager at node2.domain.net/13.111.111.11:8050

Total Nodes:2

Node-Id Node-State Node-Http-Address Number-of-Running-Containers

node1:45454 RUNNING node1:8042 0

node2:45454 RUNNING node2:8042 0

NO errors reported while restarting node manager in this server.

<<

Nodemanager status looks exactly same in both nodes but I am not sure why the restart status check fails in one node and not on the other.

How to fix this issue?

node1-downgrade-log.txt node2-downgrade-log.txt

KuldeepK · ‎04-22-2016

@selvanand panneerselvam

I checked attached txt file and noticed that it is looking for NM FQDN with RPC Port 45454, see below logs

node1.domain.net:45454 was not found in the list of running NodeManagers

When you run yarn node -list -states=RUNNING command, I see the out has short hostnames without FQDN

Can you please check yarn.nodemanager.address?

Checking NM logs should give us a hint.

View solution in original post

ravi1 · ‎04-21-2016

Can you post the nodemanagers logs for node1 and node2 as well?

KuldeepK · ‎04-22-2016

@selvanand panneerselvam

I checked attached txt file and noticed that it is looking for NM FQDN with RPC Port 45454, see below logs

node1.domain.net:45454 was not found in the list of running NodeManagers

When you run yarn node -list -states=RUNNING command, I see the out has short hostnames without FQDN

Can you please check yarn.nodemanager.address?

Checking NM logs should give us a hint.

selvanand_panne · ‎04-22-2016

@Kuldeep Kulkarni Thanks for looking into the issue. yarn.nodemanager.address is set to the default 0.0.0.0 in both nodes. And `hostname` returns short hostname in both the nodes. I tried to work around the issue by hardcoding the hostname variable with short hostname in line# 66 of nodemanager_upgrade.py and the downgrade moved ahead completed fine. I tried upgrading to 2.4.0 and that too completed fine. I am not sure if this workaround has any side effects but smoke testing of the cluster post upgrade was successful. I am still wondering how come nodemanager@node2 was successful the first time since in node2 also the output of "yarn node -list -states=RUNNING" returned the hostnames without FQDN and the upgrade script was looking for host with FQDN.

KuldeepK · ‎04-22-2016

@selvanand panneerselvam

Can you please check NM logs on both the NMs and let me know if you find something in there.

anand_raghavan · ‎09-23-2016

I faced the same issue in 2.3.2 to 2.5 upgrade where the node manager check failed on one node and went fine on other nodes and i used the same workaround. Thanks

selvanand_panne · ‎04-22-2016

@Kuldeep Kulkarni @Ravi Mutyala

i am not seeing error messages relating to this issue in the nodemanager logfiles. nodemanagerlogs.zip

jeff_stafford · ‎06-08-2018

This is a bug in Ambari. You can fix it by patching the upgrade script directly. (Posting here with my solution after suffering from this myself.) Edit /var/lib/ambari-agent/cache/common-services/YARN/your_YARN_version/package/scripts/nodemanager_upgrade.py on your NodeManager hosts:

At the top of the file with the other imports (line 20?), add:

import re

After line 65, add:

hostname_short = re.findall(r'(^\w+)\.', hostname)[0]

Change line 71 to the following:

if hostname in yarn_output or nodemanager_address in yarn_output or hostname_ip in yarn_output or hostname_short in yarn_output:

The upgrade will now properly check for short hostnames when you hit "Retry".

ammills01 · ‎10-25-2018

Thanks Jeff, this worked to help me upgrade from HDP 2.6.4.0 -> 2.6.5.0

Cloudera Community

Support Questions

Node manager restart fails during upgrade / downgrade between 2.3.4 and 2.4.0

Resolution of Failed Knox Gateway Start During CDP...

Ambari server startup failed during upgrade

Data loss(Flow file content) during NIFI Restarts

Overview of Cloudbreak 2.4.0

Using Ambari to check when upgrade / downgrade see...

During upgrade from HDP 2.4.3 to HDP2.6.0 the expr...

HDP Downgrade stuck on Namenode restart

Error while upgrading from HDP 2.3.4 to HDP 2.6.5

RPM Package installation failed during HDP Upgrade

CDH Upgrade - 'Finalize Metadata Upgrade' effect, ...