Created on 07-06-2018 09:22 PM - edited 09-16-2022 06:25 AM
Hi guys,
I'm playing around with the AWS instance HDPCA 2.3 and I have some issues when adding the node1.
I just installed the clients und for no explanable reason Ambari alerts the 3 NodeManagers down.
When I restart them, they are reported "running" for a few minutes and become red again.
yarn node -list sais, all tree are running.
Same for the ResourceManager Web UI.
The alert is about the nodemanager web service on port 8042.
After trying this with a new instance and having the same problem, I started my very own HDP installation on 6 vanilla CentOS instances. At some point, I had the same issues.
I don't have any idea, what might be the reason and where I can have a look for deeper analysis.
Any help would be much appreciated.
Thanks and bye,
Chris
Created 07-06-2018 10:15 PM
Hey @Chris K!
Could you check the logs, and share with us any error/warn/fatal msg?
/var/log/hadoop-yarn/yarn/yarn-yarn-nodemanager-<host_nodemanager>.log
Just in case, check if is there any PID running on 8042.
netstat -tulpn | grep 8042
Hope this helps!
Created 07-07-2018 12:11 PM
when starting the nodemanager via ambari, I get a process listening on 8042 for one second:
[root@resourcemanager ~]# while [ true ]; do sleep 1; netstat -tupln | grep 8042; done tcp 0 0 :::8042 :::* LISTEN 20422/java
Also, please find attached two grep-results of the log-file while restarting via ambari.
tail -f /var/log/hadoop-yarn/yarn/yarn-yarn-nodemanager-resourcemanager.log | grep -i 'error\|warn\|fatal\|severe' > /tmp/nodemanager1.log
Thank you so much for your support!
Chris
Created on 07-07-2018 12:49 PM - edited 08-18-2019 02:15 AM
Maybe that helps:
No process listening on port 8042 though...
Created 07-09-2018 04:36 AM
Hi @Chris K!
Guess you're missing a sticky bit on your /app-logs
Could you try to run the following commands?
[hdfs@node2 ~]$ hdfs dfs -ls -d /app-logs [hdfs@node2 ~]$ hdfs dfs -ls -h /app-logs #Adding the sticky bit hdfs dfs -chmod +t /app-logs
Hope this helps!
Created 07-10-2018 01:14 PM
@Vinicius Higa Murakami i also got the same issue and added that sticky bit. it worked for me for some days but again node manager is getting down.
Created 07-10-2018 05:55 PM
Hi @Punit kumar!
Could you share with us your logs?
BTW, I'd kindly ask you to open a new question so we can work in separate threads, and also will be easier to other HCC users to find the best answer 🙂
Created 07-10-2018 05:59 PM
@Vinicius Higa Murakami after fixing the sticky bit error, again i was unable to start node manager and in log there was no error msg after that again i tried to start it and at that time container is getting failed, log of that. and i have other development hadoop clusters in the aws which was working previously but now in every cluster node manager is getting down.
Created 07-13-2018 08:13 AM
I was abcent a view days due to some business trip.
I just started my AWS instance to check your sticky bit tip but the HDP started without any errors...?
I didn't do anything with it in the past days so I have no idea what's going on.
Could it be that these errors occur because of network issues during "rush hours" in AWS?
I'll have an eye on it at the weekend...
Thanks!
Chris
Created 07-13-2018 02:21 PM
Gotcha @Chris K! Well good to know that's working now 🙂
And about the issue, are you using spot instances? Yeah it's kinda strange to happen this suddenly, let's us know if this mystery shows up again. Then we can take a look at your timeout configs and look for race conditions issues as well.
The following JIRA explains something quite similar to your case.
https://issues.apache.org/jira/browse/YARN-5112
Hope this helps!
Created 07-16-2018 06:36 PM
unfortunately it wasn't that easy... the next time, I started my environment, I had these strange errors again.
But - after quite some desperate hours of trial and error - I got the point.
Whenever I started a brand new image, everything was fine. I didn't have any errors.
When I started the shutted down image the next day, it was ruined.
It occurs to me, that a solid termination of the HDP processes in ambari and a service ambari-agent stop with a service ambari-server stop would be a nicer approach and that helped.
When terminating the process in a correct manner, the restart will happen without any errors.
When just shutting down the AWS instance, it breaks.
That's it, plain and simple.
Well... the good point is: I learned a lot 😉
Thanks for your help!
Chris
Created 07-16-2018 11:31 PM
Good one! Gotcha 🙂
I didn't know about this either.
Keep it up with your studies on HDPCA 😄