Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Yarn RM Fatal shutdown cannot find root cause and what I mistake config for yarn-site.xml ?

avatar
New Contributor

Hi everyone,

Please find root cause and why yarn resourcemanager not autofailover on my cluster ?

I get message ERROR  on fatal event from my yarn-resource.log on yarn rm node (Active) below:
2023-02-25 08:04:03,805 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1670355762504_0824_01_000007 Container Transitioned from ACQUIRED to RELEASED
2023-02-25 08:29:08,810 WARN org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 6669ms for sessionid 0x1000048d7880001
2023-02-25 08:29:08,810 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 6669ms for sessionid 0x1000048d7880001, closing socket connection and attempting reconnect
2023-02-25 08:29:08,911 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode...
2023-02-25 08:29:08,911 WARN org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService: Lost contact with Zookeeper. Transitioning to standby in 10000 ms if connection is not reestablished.
2023-02-25 08:29:09,647 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server yarn-rm1.hostname/10.x.x.x:2181. Will not attempt to authenticate using SASL (unknown error)
2023-02-25 08:29:09,647 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to yarn-rm1.hostname/10.x.x.x:2181, initiating session
2023-02-25 08:29:09,686 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server yarn-rm1.hostname/10.x.x.x:2181, sessionid = 0x1000048d7880001, negotiated timeout = 10000
2023-02-25 08:29:09,686 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
2023-02-25 08:29:09,698 INFO org.apache.hadoop.conf.Configuration: found resource yarn-site.xml at file:/app/hadoop-3.2.2/etc/hadoop/yarn-site.xml
2023-02-25 08:29:09,700 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioning to standby state
2023-02-25 08:29:09,707 WARN org.apache.hadoop.yarn.server.resourcemanager.amlauncher.ApplicationMasterLauncher: org.apache.hadoop.yarn.server.resourcemanager.amlauncher.ApplicationMasterLauncher$LauncherThread interrupted. Returning.
2023-02-25 08:29:09,716 INFO org.apache.hadoop.ipc.Server: Stopping server on 8032
2023-02-25 08:29:09,727 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8032
2023-02-25 08:29:09,730 INFO org.apache.hadoop.ipc.Server: Stopping server on 8030
2023-02-25 08:29:09,734 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2023-02-25 08:29:09,737 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8030
2023-02-25 08:29:09,737 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2023-02-25 08:29:09,740 INFO org.apache.hadoop.ipc.Server: Stopping server on 8031
2023-02-25 08:29:09,748 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2023-02-25 08:29:09,748 ERROR org.apache.hadoop.yarn.event.EventDispatcher: Returning, interrupted : java.lang.InterruptedException
2023-02-25 08:29:09,748 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager thread interrupted
2023-02-25 08:29:09,749 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: AsyncDispatcher is draining to stop, ignoring any new events.
2023-02-25 08:29:09,749 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: NMLivelinessMonitor thread interrupted
2023-02-25 08:29:09,750 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8031
2023-02-25 08:29:09,751 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: AsyncDispatcher is draining to stop, ignoring any new events.
2023-02-25 08:29:09,751 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: org.apache.hadoop.yarn.server.resourcemanager.rmapp.monitor.RMAppLifetimeMonitor thread interrupted
2023-02-25 08:29:09,752 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer thread interrupted
2023-02-25 08:29:09,753 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
2023-02-25 08:29:09,751 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: AMLivelinessMonitor thread interrupted
2023-02-25 08:29:09,755 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager metrics system...
2023-02-25 08:29:09,755 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics system stopped.
2023-02-25 08:29:09,756 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: AMLivelinessMonitor thread interrupted
2023-02-25 08:29:09,758 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics system shutdown complete.
2023-02-25 08:29:09,758 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: AsyncDispatcher is draining to stop, ignoring any new events.
2023-02-25 08:29:09,759 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.RMFatalEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher
2023-02-25 08:29:09,761 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: NMTokenKeyRollingInterval: 86400000ms and NMTokenKeyActivationDelay: 900000ms
2023-02-25 08:29:09,761 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager: ContainerTokenKeyRollingInterval: 86400000ms and ContainerTokenKeyActivationDelay: 900000ms
2023-02-25 08:29:09,761 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: AMRMTokenKeyRollingInterval: 86400000ms and AMRMTokenKeyActivationDelay: 900000 ms
2023-02-25 08:29:09,762 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStoreEventType for class org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler
2023-02-25 08:29:09,762 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.NodesListManagerEventType for class org.apache.hadoop.yarn.server.resourcemanager.NodesListManager
2023-02-25 08:29:09,762 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Using Scheduler: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
2023-02-25 08:29:09,763 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.SchedulerEventType for class org.apache.hadoop.yarn.event.EventDispatcher
2023-02-25 08:29:09,763 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher
2023-02-25 08:29:09,763 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher
2023-02-25 08:29:09,763 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher
2023-02-25 08:29:09,767 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
 and my config on yarn-site.xml on resourcemanager node 
<configuration>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/data/nm-local-dir</value>
</property>
<property>
<name>yarn.node-labels.enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.node-attribute.fs-store.root-dir</name>
<value>file:///app/tmp/hadoop-yarn-yarn/node-attribute</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>yarn-rm1.hostname</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>32768</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>6</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>32768</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>6</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
<property>
<name>yarn.resourcemanager.principal</name>
<value>yarn/yarn-rm1.hostname@MYREALM</value>
</property>
<property>
<name>yarn.resourcemanager.keytab</name>
<value>/app/keytabs/hdfs.keytab</value>
</property>
<property>
<name>yarn.nodemanager.principal</name>
<value>yarn/yarn-nm1.hostname@MYREALM</value>
</property>
<property>
<name>yarn.nodemanager.keytab</name>
<value>/app/keytabs/hdfs.keytab</value>
</property>
<property>
<name>yarn.http.policy</name>
<value>HTTPS_ONLY</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.https.address</name>
<value>0.0.0.0:8089</value>
</property>
<property>
<name>yarn.nodemanager.webapp.https.address</name>
<value>0.0.0.0:8090</value>
</property>
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>yarn-rm</value>
</property>
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>yarn-rm1.hostname</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>yarn-rm2.hostname</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm1</name>
<value>yarn-rm1.hostname:8088</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm2</name>
<value>yarn-rm2.hostname:8088</value>
</property>
<property>
<name>hadoop.zk.address</name>
<value>zookeeper1:2181,zookeeper2:2181,zookeeper3:2181</value>
</property>
more detail:
I checked more environment zookeepers ,hdfs and network connection status are good health.

Anyone, Can check and suggest more details for set yarn-site.xml and please provide what should I fix in this case?

Thank you.

4 REPLIES 4

avatar
Community Manager

@Nitit_P Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our YARN experts @Bharati and @PranavM  who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.


Regards,

Diana Torres,
Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Contributor

What behaviour are you seeing? Does the RM crash on startup or does it happen when it's been running for some time? The logs suggest that the RM cannot connect to ZooKeeper and is transitioning from active into standby state.

avatar
New Contributor

Hi @JimHalfpenny  this fatal down happen when it's still running and

I try check zookeeper and find this message:
2022-12-07 02:39:07,482 [myid:3] - INFO [CommitProcessor:3:LearnerSessionTracker@116] - Committing global session 0x1000048d7880001
2023-02-25 08:29:09,649 [myid:3] - INFO [NIOWorkerThread-2:Learner@158] - Revalidating client: 0x1000048d7880001
2023-02-25 20:27:42,921 [myid:3] - INFO [RequestThrottler:QuorumZooKeeperServer@163] - Submitting global closeSession request for session 0x1000048d7880001

I'm not sure for zookeeper session timeout like this case.

https://community.cloudera.com/t5/Support-Questions/Zookeeper-average-client-session-timeout/td-p/28...

 

avatar
Rising Star

How frequently is this occurring?
It is worth checking network drops or you might consider increasing the timeouts on both sides ie. Resource Manager and Zookeeper.