Support Questions

Find answers, ask questions, and share your expertise

Node manager restart fails during upgrade / downgrade between 2.3.4 and 2.4.0

avatar
New Contributor

We are upgrading HDP from 2.3.4 to 2.4.0. by following the instructions in the below link:

https://docs.hortonworks.com/HDPDocuments/Ambari-2.2.1.1/bk_upgrading_Ambari/content/_upgrade_ambari...

All the steps in the upgrade document till “4.2 Perform express upgrade” have been completed successfully.

During the express upgrade, step “Restarting NodeManager on 2 hosts” fails in 1 host and succeeds in the other. I tried to downgrade but downgrade too failed at the same step:

>>

On host 1:

[yarn@node1 ~]$ yarn node -list -states=RUNNING

16/04/21 13:49:25 INFO impl.TimelineClientImpl: Timeline service address: http://node2.domain.net:8188/ws/v1/timeline/

16/04/21 13:49:25 INFO client.RMProxy: Connecting to ResourceManager at node2.domain.net/13.111.111.11:8050

Total Nodes:2

Node-Id Node-State Node-Http-Address Number-of-Running-Containers

node1:45454 RUNNING node1:8042 0

node2:45454 RUNNING node2:8042 0

Below is the error message I see in the error log:

resource_management.core.exceptions.Fail: NodeManager with ID node1.domain.net:45454 was not found in the list of running NodeManagers

On host 2:

[yarn@node2 sbin]$ yarn node -list -states=RUNNING

16/04/21 13:49:35 INFO impl.TimelineClientImpl: Timeline service address: http://node2.domain.net:8188/ws/v1/timeline/

16/04/21 13:49:35 INFO client.RMProxy: Connecting to ResourceManager at node2.domain.net/13.111.111.11:8050

Total Nodes:2

Node-Id Node-State Node-Http-Address Number-of-Running-Containers

node1:45454 RUNNING node1:8042 0

node2:45454 RUNNING node2:8042 0

NO errors reported while restarting node manager in this server.

<<

Nodemanager status looks exactly same in both nodes but I am not sure why the restart status check fails in one node and not on the other.

How to fix this issue?

node1-downgrade-log.txtnode2-downgrade-log.txt

1 ACCEPTED SOLUTION

avatar
Master Guru
@selvanand panneerselvam

I checked attached txt file and noticed that it is looking for NM FQDN with RPC Port 45454, see below logs

node1.domain.net:45454 was not found in the list of running NodeManagers

When you run yarn node -list -states=RUNNING command, I see the out has short hostnames without FQDN

Can you please check yarn.nodemanager.address?

Checking NM logs should give us a hint.

View solution in original post

8 REPLIES 8

avatar
Guru

Can you post the nodemanagers logs for node1 and node2 as well?

avatar
Master Guru
@selvanand panneerselvam

I checked attached txt file and noticed that it is looking for NM FQDN with RPC Port 45454, see below logs

node1.domain.net:45454 was not found in the list of running NodeManagers

When you run yarn node -list -states=RUNNING command, I see the out has short hostnames without FQDN

Can you please check yarn.nodemanager.address?

Checking NM logs should give us a hint.

avatar
New Contributor

@Kuldeep Kulkarni Thanks for looking into the issue. yarn.nodemanager.address is set to the default 0.0.0.0 in both nodes. And `hostname` returns short hostname in both the nodes. I tried to work around the issue by hardcoding the hostname variable with short hostname in line# 66 of nodemanager_upgrade.py and the downgrade moved ahead completed fine. I tried upgrading to 2.4.0 and that too completed fine. I am not sure if this workaround has any side effects but smoke testing of the cluster post upgrade was successful. I am still wondering how come nodemanager@node2 was successful the first time since in node2 also the output of "yarn node -list -states=RUNNING" returned the hostnames without FQDN and the upgrade script was looking for host with FQDN.

avatar
Master Guru

@selvanand panneerselvam

Can you please check NM logs on both the NMs and let me know if you find something in there.

avatar

I faced the same issue in 2.3.2 to 2.5 upgrade where the node manager check failed on one node and went fine on other nodes and i used the same workaround. Thanks

avatar
New Contributor

@Kuldeep Kulkarni @Ravi Mutyala

i am not seeing error messages relating to this issue in the nodemanager logfiles. nodemanagerlogs.zip

avatar
New Contributor

This is a bug in Ambari. You can fix it by patching the upgrade script directly. (Posting here with my solution after suffering from this myself.) Edit /var/lib/ambari-agent/cache/common-services/YARN/your_YARN_version/package/scripts/nodemanager_upgrade.py on your NodeManager hosts:

At the top of the file with the other imports (line 20?), add:

import re

After line 65, add:

hostname_short = re.findall(r'(^\w+)\.', hostname)[0]

Change line 71 to the following:

if hostname in yarn_output or nodemanager_address in yarn_output or hostname_ip in yarn_output or hostname_short in yarn_output:

The upgrade will now properly check for short hostnames when you hit "Retry".

avatar
Explorer

Thanks Jeff, this worked to help me upgrade from HDP 2.6.4.0 -> 2.6.5.0