Created 04-21-2016 07:06 PM
We are upgrading HDP from 2.3.4 to 2.4.0. by following the instructions in the below link:
All the steps in the upgrade document till “4.2 Perform express upgrade” have been completed successfully.
During the express upgrade, step “Restarting NodeManager on 2 hosts” fails in 1 host and succeeds in the other. I tried to downgrade but downgrade too failed at the same step:
>>
On host 1:
[yarn@node1 ~]$ yarn node -list -states=RUNNING
16/04/21 13:49:25 INFO impl.TimelineClientImpl: Timeline service address: http://node2.domain.net:8188/ws/v1/timeline/
16/04/21 13:49:25 INFO client.RMProxy: Connecting to ResourceManager at node2.domain.net/13.111.111.11:8050
Total Nodes:2
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
node1:45454 RUNNING node1:8042 0
node2:45454 RUNNING node2:8042 0
Below is the error message I see in the error log:
resource_management.core.exceptions.Fail: NodeManager with ID node1.domain.net:45454 was not found in the list of running NodeManagers
On host 2:
[yarn@node2 sbin]$ yarn node -list -states=RUNNING
16/04/21 13:49:35 INFO impl.TimelineClientImpl: Timeline service address: http://node2.domain.net:8188/ws/v1/timeline/
16/04/21 13:49:35 INFO client.RMProxy: Connecting to ResourceManager at node2.domain.net/13.111.111.11:8050
Total Nodes:2
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
node1:45454 RUNNING node1:8042 0
node2:45454 RUNNING node2:8042 0
NO errors reported while restarting node manager in this server.
<<
Nodemanager status looks exactly same in both nodes but I am not sure why the restart status check fails in one node and not on the other.
How to fix this issue?
Created 04-22-2016 06:20 AM
I checked attached txt file and noticed that it is looking for NM FQDN with RPC Port 45454, see below logs
node1.domain.net:45454 was not found in the list of running NodeManagers
When you run yarn node -list -states=RUNNING command, I see the out has short hostnames without FQDN
Can you please check yarn.nodemanager.address?
Checking NM logs should give us a hint.
Created 04-21-2016 08:08 PM
Can you post the nodemanagers logs for node1 and node2 as well?
Created 04-22-2016 06:20 AM
I checked attached txt file and noticed that it is looking for NM FQDN with RPC Port 45454, see below logs
node1.domain.net:45454 was not found in the list of running NodeManagers
When you run yarn node -list -states=RUNNING command, I see the out has short hostnames without FQDN
Can you please check yarn.nodemanager.address?
Checking NM logs should give us a hint.
Created 04-22-2016 06:46 AM
@Kuldeep Kulkarni Thanks for looking into the issue. yarn.nodemanager.address is set to the default 0.0.0.0 in both nodes. And `hostname` returns short hostname in both the nodes. I tried to work around the issue by hardcoding the hostname variable with short hostname in line# 66 of nodemanager_upgrade.py and the downgrade moved ahead completed fine. I tried upgrading to 2.4.0 and that too completed fine. I am not sure if this workaround has any side effects but smoke testing of the cluster post upgrade was successful. I am still wondering how come nodemanager@node2 was successful the first time since in node2 also the output of "yarn node -list -states=RUNNING" returned the hostnames without FQDN and the upgrade script was looking for host with FQDN.
Created 04-22-2016 06:50 AM
Can you please check NM logs on both the NMs and let me know if you find something in there.
Created 09-23-2016 04:55 AM
I faced the same issue in 2.3.2 to 2.5 upgrade where the node manager check failed on one node and went fine on other nodes and i used the same workaround. Thanks
Created 04-22-2016 07:23 AM
@Kuldeep Kulkarni @Ravi Mutyala
i am not seeing error messages relating to this issue in the nodemanager logfiles. nodemanagerlogs.zip
Created 06-08-2018 07:58 PM
This is a bug in Ambari. You can fix it by patching the upgrade script directly. (Posting here with my solution after suffering from this myself.) Edit /var/lib/ambari-agent/cache/common-services/YARN/your_YARN_version/package/scripts/nodemanager_upgrade.py on your NodeManager hosts:
At the top of the file with the other imports (line 20?), add:
import re
After line 65, add:
hostname_short = re.findall(r'(^\w+)\.', hostname)[0]
Change line 71 to the following:
if hostname in yarn_output or nodemanager_address in yarn_output or hostname_ip in yarn_output or hostname_short in yarn_output:
The upgrade will now properly check for short hostnames when you hit "Retry".
Created 10-25-2018 01:54 PM
Thanks Jeff, this worked to help me upgrade from HDP 2.6.4.0 -> 2.6.5.0