Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

YARN node manager health alerts issue

avatar
Contributor

Hello,

from past weeks server is throwing too many of alerts related to YARN node managers health.

The health test result for YARN_NODE_MANAGERS_HEALTHY has become bad: Healthy NodeManager: 0. Concerning NodeManager: 0. Total NodeManager: 33. Percent healthy: 0.00%. Percent healthy or concerning: 0.00%. Critical threshold: 90.00%.

 

 

After going through hadoop-YARN logs directory receiving below logs...

 

WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn IP=*.*.*.* OPERATION=refreshNodes TARGET=AdminService RESULT=FAILURE DESCRIPTION=ResourceManager is not active. Can not refresh nodes. PERMISSIONS=
2019-05-02 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:yarn/SERVER@HADOOP.COM (auth:KERBEROS) cause:org.apache.hadoop.ipc.StandbyException: ResourceManager rm137 is not Active!

 

 

Check the below class

https://github.com/hopshadoop/hops/blob/master/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/ha...

 

if (!isRMActive()) {
RMAuditLogger.logFailure(user, operation, "",
"AdminService", "ResourceManager is not active. Can not " + msg);
throwStandbyException();
}
}

 

private void throwStandbyException() throws StandbyException {
throw new StandbyException("ResourceManager " + rmId + " is not Active!");

}

 

Can somebody help how the above error is integrating with this.

 

7 REPLIES 7

avatar
Contributor

@Kamalindia 

 

The following warning:

 

WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn IP=*.*.*.* OPERATION=refreshNodes TARGET=AdminService RESULT=FAILURE DESCRIPTION=ResourceManager is not active. Can not refresh nodes. PERMISSIONS=

2019-05-02 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:yarn/SERVER@HADOOP.COM (auth:KERBEROS) cause:org.apache.hadoop.ipc.StandbyException: ResourceManager rm137 is not Active!

is a general warning that can show up in the Standby RM logs when a "refreshNodes" action has been performed internally within YARN (or explicitly through the dropdown when the Refresh Nodes option is accessible), for example, when a decommission or a recommission of a NodeManager happens etc.

 

The above warning is benign and just highlights the fact that the Standby RM is rejecting a refreshNodes command which is expected as it is in Standby. Note that if a failover happens and the Standby RM is transitioned to Active, then, as part of that it will "refresh" anyway.

 

Therefore, you may have to review the Active RM logs instead for the timeframe when you saw the bad health alert for NMs.

avatar
Contributor

Hello Sid,

After going through the active Resource managers log I didn't find anything.Just the Info logs appeared , no warning or error logs there.

what more need to check now.

 

avatar
Contributor

Hey @Kamalindia ,

 

Thanks for checking.

 

From the warning it seems that none of the NodeManagers were found to be healthy which I find odd.

 

Can you kindly confirm if this alert is always present? Also, does this alert show up for any other CDH roles (other than NodeManager)?

 

The reason I ask this question is I am wondering if Cloudera Manager's Service Monitor role that does these checks is itself having issues. 

 

You may want to check the ServiceMonitor logs to check for any errors and charts around JVM Heap usage etc. to see if Service Monitor itself is having issues resulting in these alerts.

 

Another thing I can think of is to check the NodeManager logs to see if those NodeManagers are functioning or not and if YARN is able to allocate any containers to any of the NodeManagers.

avatar
Contributor

I am getting service monitor alerts from long time but these YARN specific alerts come to me form last week.

some of the information which I am getting....

 

org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_1555641507362_184814,name=INSERT OVERWRITE TABL...Test(Stage-1),user=hive,queue=root.van,state=FINISHED,trackingUrl=myhost:8088/proxy/application_1555641507362_184814/,appMasterHost=XYZ,startTime=1557309811366,finishTime=1557310013621,finalStatus=SUCCEEDED,memorySeconds=13123252,vcoreSeconds=2985,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=<memory:0\, vCores:0>
2019-05-08 05:06:59,921 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Max number of completed apps kept in state store met: maxCompletedAppsInStateStore = 10000, removing app application_1555641507362_174834 from state store.
2019-05-08 05:06:59,921 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Application should be expired, max number of completed apps kept in memory met: maxCompletedAppsInMemory = 10000, removing app application_1555641507362_174834 from memory:
2019-05-08 05:06:59,921 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing info for app: application_1555641507362_174834
2019-05-08 05:07:00,047 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e64_1555641507362_184829_01_000007 Container Transitioned from RUNNING to COMPLETED
2019-05-08 05:07:00,047 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: Completed container: container_e64_1555641507362_184829_01_000007 in state: COMPLETED event:FINISHED

 

 

 

avatar
Super Guru
@Kamalindia

Have you checked the RM web UI to see how many healthy and unhealthy nodes are reported and if the alert from CM is in sync with what RM is seeing?

Have you tried to run any YANR jobs to see if any of them can progress?

It is unlikely that all of them are down at the same time, and I am wondering if CM is not reporting correctly. Can you go into individual NodeManger home page in CM and see what are being reported there as well?

avatar
Contributor

@Kamalindia 

With regards to the following info-level logs (not alerts):

 

2019-05-08 05:06:59,921 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Max number of completed apps kept in state store met: maxCompletedAppsInStateStore = 10000, removing app application_1555641507362_174834 from state store.
2019-05-08 05:06:59,921 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Application should be expired, max number of completed apps kept in memory met: maxCompletedAppsInMemory = 10000, removing app application_1555641507362_174834 from memory:
2019-05-08 05:06:59,921 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing info for app: application_1555641507362_174834

this is very normal.

 

YARN's ResourceManager process remembers a maximum of 10,000 most recently completed applications and any currently active applications. Therefore, all applications will get displayed on the YARN ResourceManager Web UI - whether it be running, failed, finished etc., however, all "completed" applications are limited to 10,000 apps based on the setting "yarn.resourcemanager.max-completed-applications".

 

The list of applications to be displayed in RM Web UI is refreshed internally within RM such that only the 10,000 most recent completed applications are visible along with any currently active applications. 

For example, let us say there are 20,000 applications that get run in one day in the cluster, with the first 10,000 completed within first 12 hours and the next set of 10,000 in the remaining 12 hours of the day. Also, assume that currently no application is shown in the RM Web UI (for example, when a restart of the RM/s have been performed and there is nothing in the process memory).

 

After 12 hours, the first set of 10,000 applications would have all completed and they would still be viewable in the RM Web UI. However, the next application that finishes now will push the oldest one out. And another 12 hours later, none of the first set of 10,000 completed applications would still be visible as they would be completely replaced by the 2nd set of 10,000 completed applications. Thus, we will have a sliding window of 12 hours of completed  applications in the RM Web UI.

 

Coming back to the logs above, they are indicating that application_1555641507362_174834 will no longer be retained in RM process's memory because there are already 10,000 other recently completed applications that the RM is keeping track of. This is very normal as we cannot expect RM to hold an infinite number of apps in its memory.

avatar
New Contributor

Hi, may this cause failure of the job ? Thanks.