Support Questions

Find answers, ask questions, and share your expertise

Node manager unexpected exits

avatar
Contributor

Hi every one,

 

i have a 5 node cdh cluster.in my cluster i am observing that node managers are restarting continuously.

i am not sure what is going on i attaching the stdout and stderr and roll log.

 

can you please help me

Stderr

+ exec /opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/hadoop-yarn/bin/yarn nodemanager
Aug 02, 2018 11:30:32 AM com.google.inject.servlet.InternalServletModule$BackwardsCompatibleServletContextProvider get
WARNING: You are attempting to use a deprecated API (specifically, attempting to @Inject ServletContext inside an eagerly created singleton. While we allow this for backwards compatibility, be warned that this MAY have unexpected behavior if you have more than one injector (with ServletModule) running in the same JVM. Please consult the Guice documentation at http://code.google.com/p/google-guice/wiki/Servlets for more information.
Aug 02, 2018 11:30:32 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFO: Registering org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices as a root resource class
Aug 02, 2018 11:30:32 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFO: Registering org.apache.hadoop.yarn.webapp.GenericExceptionHandler as a provider class
Aug 02, 2018 11:30:32 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFO: Registering org.apache.hadoop.yarn.server.nodemanager.webapp.JAXBContextResolver as a provider class
Aug 02, 2018 11:30:32 AM com.sun.jersey.server.impl.application.WebApplicationImpl _initiate
INFO: Initiating Jersey application, version 'Jersey: 1.9 09/02/2011 11:17 AM'
Aug 02, 2018 11:30:32 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding org.apache.hadoop.yarn.server.nodemanager.webapp.JAXBContextResolver to GuiceManagedComponentProvider with the scope "Singleton"
Aug 02, 2018 11:30:32 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding org.apache.hadoop.yarn.webapp.GenericExceptionHandler to GuiceManagedComponentProvider with the scope "Singleton"
Aug 02, 2018 11:30:33 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices to GuiceManagedComponentProvider with the scope "Singleton"

role log

11:48:42.105 AM	INFO	ContainerManagerImpl	
Start request for container_1533205969497_0410_01_000001 by user dr.who
11:48:42.105 AM	INFO	ContainerManagerImpl	
Creating a new application reference for app application_1533205969497_0410
11:48:42.105 AM	INFO	Application	
Application application_1533205969497_0410 transitioned from NEW to INITING
11:48:42.106 AM	INFO	NMAuditLogger	
USER=dr.who	IP=172.31.24.227	OPERATION=Start Container Request	TARGET=ContainerManageImpl	RESULT=SUCCESS	APPID=application_1533205969497_0410	CONTAINERID=container_1533205969497_0410_01_000001
11:48:42.108 AM	INFO	AppLogAggregatorImpl	
rollingMonitorInterval is set as -1. The log rolling monitoring interval is disabled. The logs will be aggregated after this application is finished.
11:48:42.125 AM	INFO	Application	
Adding container_1533205969497_0410_01_000001 to application application_1533205969497_0410
11:48:42.125 AM	INFO	Application	
Application application_1533205969497_0410 transitioned from INITING to RUNNING
11:48:42.125 AM	INFO	Container	
Container container_1533205969497_0410_01_000001 transitioned from NEW to LOCALIZED
11:48:42.125 AM	INFO	AuxServices	
Got event CONTAINER_INIT for appId application_1533205969497_0410
11:48:42.125 AM	INFO	YarnShuffleService	
Initializing container container_1533205969497_0410_01_000001
11:48:42.144 AM	INFO	Container	
Container container_1533205969497_0410_01_000001 transitioned from LOCALIZED to RUNNING
11:48:42.147 AM	INFO	DefaultContainerExecutor	
launchContainer: [bash, /data0/yarn/nm/usercache/dr.who/appcache/application_1533205969497_0410/container_1533205969497_0410_01_000001/default_container_executor.sh]
11:48:42.162 AM	WARN	DefaultContainerExecutor	
Exit code from container container_1533205969497_0410_01_000001 is : 143
11:48:42.164 AM	INFO	Container	
Container container_1533205969497_0410_01_000001 transitioned from RUNNING to EXITED_WITH_FAILURE
11:48:42.164 AM	INFO	ContainerLaunch	
Cleaning up container container_1533205969497_0410_01_000001
11:48:42.181 AM	INFO	DefaultContainerExecutor	
Deleting absolute path : /data0/yarn/nm/usercache/dr.who/appcache/application_1533205969497_0410/container_1533205969497_0410_01_000001
11:48:42.182 AM	WARN	NMAuditLogger	
USER=dr.who	OPERATION=Container Finished - Failed	TARGET=ContainerImpl	RESULT=FAILURE	DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE	APPID=application_1533205969497_0410	CONTAINERID=container_1533205969497_0410_01_000001
11:48:42.182 AM	INFO	Container	
Container container_1533205969497_0410_01_000001 transitioned from EXITED_WITH_FAILURE to DONE
11:48:42.182 AM	INFO	Application	
Removing container_1533205969497_0410_01_000001 from application application_1533205969497_0410
11:48:42.182 AM	INFO	AppLogAggregatorImpl	
Considering container container_1533205969497_0410_01_000001 for log-aggregation
11:48:42.182 AM	INFO	AuxServices	
Got event CONTAINER_STOP for appId application_1533205969497_0410
11:48:42.182 AM	INFO	YarnShuffleService	
Stopping container container_1533205969497_0410_01_000001
11:48:43.185 AM	INFO	NodeStatusUpdaterImpl	
Removed completed containers from NM context: [container_1533205969497_0410_01_000001]
1 ACCEPTED SOLUTION

avatar
Contributor

Yes i have resolved this with resource pools 

View solution in original post

11 REPLIES 11

avatar
Expert Contributor
Did you find any solution ? i am facing this issue as well

avatar
Contributor

Yes i have resolved this with resource pools 

avatar
Expert Contributor
can you please help me here ? what solution did you adopt ? I have just recently started using it

avatar
New Contributor

Can you let us know what exactly you have done to resolve this issue ?

avatar
Expert Contributor
He created, resource pool in yarn for omitting this issue, but in my case it was totally different. I would be happy to help you if you share logs or something. Who is creating the job ? is it Dr who or your own use r?

avatar
New Contributor

I am not submitting any job just after YARN installation on 4 nodes all the nodes are continously getting unexpected exits.

 

I am using google cloud.

Logs below :-

 

+ exec /opt/cloudera/parcels/CDH-5.15.0-1.cdh5.15.0.p0.21/lib/hadoop-yarn/bin/yarn nodemanager
Aug 19, 2018 1:55:16 PM com.google.inject.servlet.InternalServletModule$BackwardsCompatibleServletContextProvider get
WARNING: You are attempting to use a deprecated API (specifically, attempting to @Inject ServletContext inside an eagerly created singleton. While we allow this for backwards compatibility, be warned that this MAY have unexpected behavior if you have more than one injector (with ServletModule) running in the same JVM. Please consult the Guice documentation at http://code.google.com/p/google-guice/wiki/Servlets for more information.
Aug 19, 2018 1:55:16 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFO: Registering org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices as a root resource class
Aug 19, 2018 1:55:16 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFO: Registering org.apache.hadoop.yarn.webapp.GenericExceptionHandler as a provider class
Aug 19, 2018 1:55:16 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFO: Registering org.apache.hadoop.yarn.server.nodemanager.webapp.JAXBContextResolver as a provider class
Aug 19, 2018 1:55:16 PM com.sun.jersey.server.impl.application.WebApplicationImpl _initiate
INFO: Initiating Jersey application, version 'Jersey: 1.9 09/02/2011 11:17 AM'
Aug 19, 2018 1:55:16 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider0   

  

avatar
Expert Contributor

well, this is not helping, can you post resource manager logs ? unexpected exists are usually occured because of something going on with resource manager. Beside do check in application section of Resource manager web ui are there any running apps? if yes who is the user ?

avatar
New Contributor

Thanks for pointing me to check the resource manager logs and GUI.

Issue is fixed now.

 

Reason is since i have not enabled firewall on the google cloud for my cluster there are Hundreds of Yarn jobs are triggered by dr.who user. These jobs go into "ACCEPTED" state and then to "FAILED". 

Causing nodemanagers to continously fail  after enbling firewal and restart of YARN  services this is fixed now. 

 

 

 

avatar
Expert Contributor

yeah, this dr who is actually name of any unautherized user, it happens when your cluster is exposed to internet. you might also want to check any running crons and kill them using sudo -u yarn crontab -l on resource manager host.

 

Thanks