Support Questions

Pettax · ‎10-09-2018

Hi all,

we have our cluster deployed on AWS EC2 instances where some of the worker noedes are on spot instances. Usually there is no problem when spot instances disapear. We have time to decomission them from CM. Recently we have started to experience a ResourceManager crash in connection when we loose spot instances. See log below. After the ResourceManager crashes it does not restart automatically and after a while, all of our remaining NodeManger processes are shut down as well leaving no YARN capacity left at all eventhough we have plenty of helthy machines. We are using CDH 5.14.2.

1. Is the problem in the stack trace below known (Timer allready cancelled)

2. Can we change the configuration to have the ResourceManager automatically recover from this? I only see a automatically restart option for JobHistory server in CM but perhaps this is the same process?

Br,

Petter

2018-10-08 16:14:45,617 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, FSPreemptionThread, that exited unexpectedly: java.lang.IllegalStateException: Timer already cancelled.
        at java.util.Timer.sched(Timer.java:397)
        at java.util.Timer.schedule(Timer.java:193)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.preemptContainers(FSPreemptionThread.java:212)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.run(FSPreemptionThread.java:77)

2018-10-08 16:14:45,623 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Shutting down the resource manager.
2018-10-08 16:14:45,624 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2018-10-08 16:14:45,629 INFO org.mortbay.log: Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@ip-10-255-4-86.eu-west-1.compute.internal:8088
2018-10-08 16:14:45,731 INFO org.apache.hadoop.ipc.Server: Stopping server on 8032
2018-10-08 16:14:45,732 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8032
2018-10-08 16:14:45,732 INFO org.apache.hadoop.ipc.Server: Stopping server on 8033
2018-10-08 16:14:45,732 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2018-10-08 16:14:45,732 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8033
2018-10-08 16:14:45,733 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2018-10-08 16:14:48,250 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ip-10-255-4-86.eu-west-1.compute.internal/10.255.4.86:8033
2018-10-08 16:14:49,643 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: ip-10-255-4-86.eu-west-1.compute.internal/10.255.4.86:8033. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-10-08 16:14:50,644 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: ip-10-255-4-86.eu-west-1.compute.internal/10.255.4.86:8033. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-10-08 16:14:51,647 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: ip-10-255-4-

Bryan_zh · ‎06-15-2021

hi, has this problem been solved? I also faced this problem .

VidyaSargur · ‎06-15-2021

@Bryan_zh as this is an older post, you would have a better chance of receiving a resolution by starting a new thread. This will also be an opportunity to provide details specific to your environment that could aid others in assisting you with a more accurate answer to your question. You can link this thread as a reference in your new post.

Regards,

Vidya Sargur,
Community Manager

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Community Guidelines
How to use the forum

Bryan_zh · ‎08-09-2021

Okay, Thanks very much.