Created on 10-09-2018 02:58 AM - edited 09-16-2022 06:47 AM
Hi all,
we have our cluster deployed on AWS EC2 instances where some of the worker noedes are on spot instances. Usually there is no problem when spot instances disapear. We have time to decomission them from CM. Recently we have started to experience a ResourceManager crash in connection when we loose spot instances. See log below. After the ResourceManager crashes it does not restart automatically and after a while, all of our remaining NodeManger processes are shut down as well leaving no YARN capacity left at all eventhough we have plenty of helthy machines. We are using CDH 5.14.2.
1. Is the problem in the stack trace below known (Timer allready cancelled)
2. Can we change the configuration to have the ResourceManager automatically recover from this? I only see a automatically restart option for JobHistory server in CM but perhaps this is the same process?
Br,
Petter
2018-10-08 16:14:45,617 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, FSPreemptionThread, that exited unexpectedly: java.lang.IllegalStateException: Timer already cancelled.
        at java.util.Timer.sched(Timer.java:397)
        at java.util.Timer.schedule(Timer.java:193)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.preemptContainers(FSPreemptionThread.java:212)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.run(FSPreemptionThread.java:77)
2018-10-08 16:14:45,623 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Shutting down the resource manager.
2018-10-08 16:14:45,624 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2018-10-08 16:14:45,629 INFO org.mortbay.log: Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@ip-10-255-4-86.eu-west-1.compute.internal:8088
2018-10-08 16:14:45,731 INFO org.apache.hadoop.ipc.Server: Stopping server on 8032
2018-10-08 16:14:45,732 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8032
2018-10-08 16:14:45,732 INFO org.apache.hadoop.ipc.Server: Stopping server on 8033
2018-10-08 16:14:45,732 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2018-10-08 16:14:45,732 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8033
2018-10-08 16:14:45,733 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2018-10-08 16:14:48,250 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ip-10-255-4-86.eu-west-1.compute.internal/10.255.4.86:8033
2018-10-08 16:14:49,643 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: ip-10-255-4-86.eu-west-1.compute.internal/10.255.4.86:8033. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-10-08 16:14:50,644 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: ip-10-255-4-86.eu-west-1.compute.internal/10.255.4.86:8033. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-10-08 16:14:51,647 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: ip-10-255-4-
Created on 06-15-2021 04:26 AM - edited 06-15-2021 04:27 AM
hi, has this problem been solved? I also faced this problem .
Created 06-15-2021 08:04 AM
@Bryan_zh as this is an older post, you would have a better chance of receiving a resolution by starting a new thread. This will also be an opportunity to provide details specific to your environment that could aid others in assisting you with a more accurate answer to your question. You can link this thread as a reference in your new post.
Regards,
Vidya Sargur,Created 08-09-2021 12:51 AM
Okay, Thanks very much.
 
					
				
				
			
		
