Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

NodeManagers go down after a few minutes in HDPCA AWS Instance for no reason

Highlighted

NodeManagers go down after a few minutes in HDPCA AWS Instance for no reason

New Contributor

Hi guys,

I'm playing around with the AWS instance HDPCA 2.3 and I have some issues when adding the node1.

I just installed the clients und for no explanable reason Ambari alerts the 3 NodeManagers down.

When I restart them, they are reported "running" for a few minutes and become red again.

yarn node -list sais, all tree are running.

Same for the ResourceManager Web UI.

The alert is about the nodemanager web service on port 8042.

After trying this with a new instance and having the same problem, I started my very own HDP installation on 6 vanilla CentOS instances. At some point, I had the same issues.

I don't have any idea, what might be the reason and where I can have a look for deeper analysis.

Any help would be much appreciated.

Thanks and bye,

Chris

11 REPLIES 11
Highlighted

Re: NodeManagers go down after a few minutes in HDPCA AWS Instance for no reason

Hey @Chris K!
Could you check the logs, and share with us any error/warn/fatal msg?

/var/log/hadoop-yarn/yarn/yarn-yarn-nodemanager-<host_nodemanager>.log

Just in case, check if is there any PID running on 8042.

netstat -tulpn | grep 8042

Hope this helps!

Highlighted

Re: NodeManagers go down after a few minutes in HDPCA AWS Instance for no reason

New Contributor

Hi @Vinicius Higa Murakami,

when starting the nodemanager via ambari, I get a process listening on 8042 for one second:

[root@resourcemanager ~]# while [ true ]; do sleep 1; netstat -tupln | grep 8042; done
tcp        0      0 :::8042                     :::*                        LISTEN      20422/java

Also, please find attached two grep-results of the log-file while restarting via ambari.

tail -f /var/log/hadoop-yarn/yarn/yarn-yarn-nodemanager-resourcemanager.log | grep -i 'error\|warn\|fatal\|severe' > /tmp/nodemanager1.log

Thank you so much for your support!

Chris

Highlighted

Re: NodeManagers go down after a few minutes in HDPCA AWS Instance for no reason

New Contributor

Maybe that helps:

80428-ambari.png

80429-resourcemanager-ui.png

No process listening on port 8042 though...

Highlighted

Re: NodeManagers go down after a few minutes in HDPCA AWS Instance for no reason

Hi @Chris K!
Guess you're missing a sticky bit on your /app-logs
Could you try to run the following commands?

[hdfs@node2 ~]$ hdfs dfs -ls -d /app-logs
[hdfs@node2 ~]$ hdfs dfs -ls -h /app-logs
#Adding the sticky bit
hdfs dfs -chmod +t /app-logs

Hope this helps!

Highlighted

Re: NodeManagers go down after a few minutes in HDPCA AWS Instance for no reason

@Vinicius Higa Murakami i also got the same issue and added that sticky bit. it worked for me for some days but again node manager is getting down.

Highlighted

Re: NodeManagers go down after a few minutes in HDPCA AWS Instance for no reason

Hi @Punit kumar!
Could you share with us your logs?
BTW, I'd kindly ask you to open a new question so we can work in separate threads, and also will be easier to other HCC users to find the best answer :)

Re: NodeManagers go down after a few minutes in HDPCA AWS Instance for no reason

@Vinicius Higa Murakami after fixing the sticky bit error, again i was unable to start node manager and in log there was no error msg after that again i tried to start it and at that time container is getting failed, log of that. and i have other development hadoop clusters in the aws which was working previously but now in every cluster node manager is getting down.

Highlighted

Re: NodeManagers go down after a few minutes in HDPCA AWS Instance for no reason

New Contributor

Hi @Vinicius Higa Murakami,

I was abcent a view days due to some business trip.

I just started my AWS instance to check your sticky bit tip but the HDP started without any errors...?

I didn't do anything with it in the past days so I have no idea what's going on.

Could it be that these errors occur because of network issues during "rush hours" in AWS?

I'll have an eye on it at the weekend...

Thanks!

Chris

Highlighted

Re: NodeManagers go down after a few minutes in HDPCA AWS Instance for no reason

Gotcha @Chris K! Well good to know that's working now :)
And about the issue, are you using spot instances? Yeah it's kinda strange to happen this suddenly, let's us know if this mystery shows up again. Then we can take a look at your timeout configs and look for race conditions issues as well.

The following JIRA explains something quite similar to your case.

https://issues.apache.org/jira/browse/YARN-5112
Hope this helps!

Don't have an account?
Coming from Hortonworks? Activate your account here