Created on 08-02-2018 04:50 AM - edited 09-16-2022 06:32 AM
Hi every one,
i have a 5 node cdh cluster.in my cluster i am observing that node managers are restarting continuously.
i am not sure what is going on i attaching the stdout and stderr and roll log.
can you please help me
Stderr
+ exec /opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/hadoop-yarn/bin/yarn nodemanager Aug 02, 2018 11:30:32 AM com.google.inject.servlet.InternalServletModule$BackwardsCompatibleServletContextProvider get WARNING: You are attempting to use a deprecated API (specifically, attempting to @Inject ServletContext inside an eagerly created singleton. While we allow this for backwards compatibility, be warned that this MAY have unexpected behavior if you have more than one injector (with ServletModule) running in the same JVM. Please consult the Guice documentation at http://code.google.com/p/google-guice/wiki/Servlets for more information. Aug 02, 2018 11:30:32 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register INFO: Registering org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices as a root resource class Aug 02, 2018 11:30:32 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register INFO: Registering org.apache.hadoop.yarn.webapp.GenericExceptionHandler as a provider class Aug 02, 2018 11:30:32 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register INFO: Registering org.apache.hadoop.yarn.server.nodemanager.webapp.JAXBContextResolver as a provider class Aug 02, 2018 11:30:32 AM com.sun.jersey.server.impl.application.WebApplicationImpl _initiate INFO: Initiating Jersey application, version 'Jersey: 1.9 09/02/2011 11:17 AM' Aug 02, 2018 11:30:32 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider INFO: Binding org.apache.hadoop.yarn.server.nodemanager.webapp.JAXBContextResolver to GuiceManagedComponentProvider with the scope "Singleton" Aug 02, 2018 11:30:32 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider INFO: Binding org.apache.hadoop.yarn.webapp.GenericExceptionHandler to GuiceManagedComponentProvider with the scope "Singleton" Aug 02, 2018 11:30:33 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider INFO: Binding org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices to GuiceManagedComponentProvider with the scope "Singleton"
role log
11:48:42.105 AM INFO ContainerManagerImpl Start request for container_1533205969497_0410_01_000001 by user dr.who 11:48:42.105 AM INFO ContainerManagerImpl Creating a new application reference for app application_1533205969497_0410 11:48:42.105 AM INFO Application Application application_1533205969497_0410 transitioned from NEW to INITING 11:48:42.106 AM INFO NMAuditLogger USER=dr.who IP=172.31.24.227 OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1533205969497_0410 CONTAINERID=container_1533205969497_0410_01_000001 11:48:42.108 AM INFO AppLogAggregatorImpl rollingMonitorInterval is set as -1. The log rolling monitoring interval is disabled. The logs will be aggregated after this application is finished. 11:48:42.125 AM INFO Application Adding container_1533205969497_0410_01_000001 to application application_1533205969497_0410 11:48:42.125 AM INFO Application Application application_1533205969497_0410 transitioned from INITING to RUNNING 11:48:42.125 AM INFO Container Container container_1533205969497_0410_01_000001 transitioned from NEW to LOCALIZED 11:48:42.125 AM INFO AuxServices Got event CONTAINER_INIT for appId application_1533205969497_0410 11:48:42.125 AM INFO YarnShuffleService Initializing container container_1533205969497_0410_01_000001 11:48:42.144 AM INFO Container Container container_1533205969497_0410_01_000001 transitioned from LOCALIZED to RUNNING 11:48:42.147 AM INFO DefaultContainerExecutor launchContainer: [bash, /data0/yarn/nm/usercache/dr.who/appcache/application_1533205969497_0410/container_1533205969497_0410_01_000001/default_container_executor.sh] 11:48:42.162 AM WARN DefaultContainerExecutor Exit code from container container_1533205969497_0410_01_000001 is : 143 11:48:42.164 AM INFO Container Container container_1533205969497_0410_01_000001 transitioned from RUNNING to EXITED_WITH_FAILURE 11:48:42.164 AM INFO ContainerLaunch Cleaning up container container_1533205969497_0410_01_000001 11:48:42.181 AM INFO DefaultContainerExecutor Deleting absolute path : /data0/yarn/nm/usercache/dr.who/appcache/application_1533205969497_0410/container_1533205969497_0410_01_000001 11:48:42.182 AM WARN NMAuditLogger USER=dr.who OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE APPID=application_1533205969497_0410 CONTAINERID=container_1533205969497_0410_01_000001 11:48:42.182 AM INFO Container Container container_1533205969497_0410_01_000001 transitioned from EXITED_WITH_FAILURE to DONE 11:48:42.182 AM INFO Application Removing container_1533205969497_0410_01_000001 from application application_1533205969497_0410 11:48:42.182 AM INFO AppLogAggregatorImpl Considering container container_1533205969497_0410_01_000001 for log-aggregation 11:48:42.182 AM INFO AuxServices Got event CONTAINER_STOP for appId application_1533205969497_0410 11:48:42.182 AM INFO YarnShuffleService Stopping container container_1533205969497_0410_01_000001 11:48:43.185 AM INFO NodeStatusUpdaterImpl Removed completed containers from NM context: [container_1533205969497_0410_01_000001]
Created 08-09-2018 03:05 AM
Yes i have resolved this with resource pools
Created 08-07-2018 01:32 AM
Created 08-09-2018 03:05 AM
Yes i have resolved this with resource pools
Created 08-09-2018 03:37 AM
Created 08-19-2018 07:10 AM
Can you let us know what exactly you have done to resolve this issue ?
Created 08-19-2018 07:13 AM
Created 08-19-2018 08:14 AM
I am not submitting any job just after YARN installation on 4 nodes all the nodes are continously getting unexpected exits.
I am using google cloud.
Logs below :-
+ exec /opt/cloudera/parcels/CDH-5.15.0-1.cdh5.15.0.p0.21/lib/hadoop-yarn/bin/yarn nodemanager Aug 19, 2018 1:55:16 PM com.google.inject.servlet.InternalServletModule$BackwardsCompatibleServletContextProvider get WARNING: You are attempting to use a deprecated API (specifically, attempting to @Inject ServletContext inside an eagerly created singleton. While we allow this for backwards compatibility, be warned that this MAY have unexpected behavior if you have more than one injector (with ServletModule) running in the same JVM. Please consult the Guice documentation at http://code.google.com/p/google-guice/wiki/Servlets for more information. Aug 19, 2018 1:55:16 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register INFO: Registering org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices as a root resource class Aug 19, 2018 1:55:16 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register INFO: Registering org.apache.hadoop.yarn.webapp.GenericExceptionHandler as a provider class Aug 19, 2018 1:55:16 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register INFO: Registering org.apache.hadoop.yarn.server.nodemanager.webapp.JAXBContextResolver as a provider class Aug 19, 2018 1:55:16 PM com.sun.jersey.server.impl.application.WebApplicationImpl _initiate INFO: Initiating Jersey application, version 'Jersey: 1.9 09/02/2011 11:17 AM' Aug 19, 2018 1:55:16 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider0
Created on 08-19-2018 08:24 AM - edited 08-19-2018 08:26 AM
well, this is not helping, can you post resource manager logs ? unexpected exists are usually occured because of something going on with resource manager. Beside do check in application section of Resource manager web ui are there any running apps? if yes who is the user ?
Created 08-19-2018 09:07 AM
Thanks for pointing me to check the resource manager logs and GUI.
Issue is fixed now.
Reason is since i have not enabled firewall on the google cloud for my cluster there are Hundreds of Yarn jobs are triggered by dr.who user. These jobs go into "ACCEPTED" state and then to "FAILED".
Causing nodemanagers to continously fail after enbling firewal and restart of YARN services this is fixed now.
Created 08-19-2018 09:22 AM
yeah, this dr who is actually name of any unautherized user, it happens when your cluster is exposed to internet. you might also want to check any running crons and kill them using sudo -u yarn crontab -l on resource manager host.
Thanks