Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Node manager restart fails during upgrade / downgrade between 2.3.4 and 2.4.0

avatar

We are upgrading HDP from 2.3.4 to 2.4.0. by following the instructions in the below link:

https://docs.hortonworks.com/HDPDocuments/Ambari-2.2.1.1/bk_upgrading_Ambari/content/_upgrade_ambari...

All the steps in the upgrade document till “4.2 Perform express upgrade” have been completed successfully.

During the express upgrade, step “Restarting NodeManager on 2 hosts” fails in 1 host and succeeds in the other. I tried to downgrade but downgrade too failed at the same step:

>>

On host 1:

[yarn@node1 ~]$ yarn node -list -states=RUNNING

16/04/21 13:49:25 INFO impl.TimelineClientImpl: Timeline service address: http://node2.domain.net:8188/ws/v1/timeline/

16/04/21 13:49:25 INFO client.RMProxy: Connecting to ResourceManager at node2.domain.net/13.111.111.11:8050

Total Nodes:2

Node-Id Node-State Node-Http-Address Number-of-Running-Containers

node1:45454 RUNNING node1:8042 0

node2:45454 RUNNING node2:8042 0

Below is the error message I see in the error log:

resource_management.core.exceptions.Fail: NodeManager with ID node1.domain.net:45454 was not found in the list of running NodeManagers

On host 2:

[yarn@node2 sbin]$ yarn node -list -states=RUNNING

16/04/21 13:49:35 INFO impl.TimelineClientImpl: Timeline service address: http://node2.domain.net:8188/ws/v1/timeline/

16/04/21 13:49:35 INFO client.RMProxy: Connecting to ResourceManager at node2.domain.net/13.111.111.11:8050

Total Nodes:2

Node-Id Node-State Node-Http-Address Number-of-Running-Containers

node1:45454 RUNNING node1:8042 0

node2:45454 RUNNING node2:8042 0

NO errors reported while restarting node manager in this server.

<<

Nodemanager status looks exactly same in both nodes but I am not sure why the restart status check fails in one node and not on the other.

How to fix this issue?

node1-downgrade-log.txtnode2-downgrade-log.txt

1 ACCEPTED SOLUTION

avatar
Master Guru
@selvanand panneerselvam

I checked attached txt file and noticed that it is looking for NM FQDN with RPC Port 45454, see below logs

node1.domain.net:45454 was not found in the list of running NodeManagers

When you run yarn node -list -states=RUNNING command, I see the out has short hostnames without FQDN

Can you please check yarn.nodemanager.address?

Checking NM logs should give us a hint.

View solution in original post

8 REPLIES 8

avatar
Guru

Can you post the nodemanagers logs for node1 and node2 as well?

avatar
Master Guru
@selvanand panneerselvam

I checked attached txt file and noticed that it is looking for NM FQDN with RPC Port 45454, see below logs

node1.domain.net:45454 was not found in the list of running NodeManagers

When you run yarn node -list -states=RUNNING command, I see the out has short hostnames without FQDN

Can you please check yarn.nodemanager.address?

Checking NM logs should give us a hint.

avatar

@Kuldeep Kulkarni Thanks for looking into the issue. yarn.nodemanager.address is set to the default 0.0.0.0 in both nodes. And `hostname` returns short hostname in both the nodes. I tried to work around the issue by hardcoding the hostname variable with short hostname in line# 66 of nodemanager_upgrade.py and the downgrade moved ahead completed fine. I tried upgrading to 2.4.0 and that too completed fine. I am not sure if this workaround has any side effects but smoke testing of the cluster post upgrade was successful. I am still wondering how come nodemanager@node2 was successful the first time since in node2 also the output of "yarn node -list -states=RUNNING" returned the hostnames without FQDN and the upgrade script was looking for host with FQDN.

avatar
Master Guru

@selvanand panneerselvam

Can you please check NM logs on both the NMs and let me know if you find something in there.

avatar
New Member

I faced the same issue in 2.3.2 to 2.5 upgrade where the node manager check failed on one node and went fine on other nodes and i used the same workaround. Thanks

avatar

@Kuldeep Kulkarni @Ravi Mutyala

i am not seeing error messages relating to this issue in the nodemanager logfiles. nodemanagerlogs.zip

avatar
New Member

This is a bug in Ambari. You can fix it by patching the upgrade script directly. (Posting here with my solution after suffering from this myself.) Edit /var/lib/ambari-agent/cache/common-services/YARN/your_YARN_version/package/scripts/nodemanager_upgrade.py on your NodeManager hosts:

At the top of the file with the other imports (line 20?), add:

import re

After line 65, add:

hostname_short = re.findall(r'(^\w+)\.', hostname)[0]

Change line 71 to the following:

if hostname in yarn_output or nodemanager_address in yarn_output or hostname_ip in yarn_output or hostname_short in yarn_output:

The upgrade will now properly check for short hostnames when you hit "Retry".

avatar
New Member

Thanks Jeff, this worked to help me upgrade from HDP 2.6.4.0 -> 2.6.5.0