About SidAhuja

SidAhuja · ‎05-27-2019

@Kamalindia With regards to the following info-level logs (not alerts): 2019-05-08 05:06:59,921 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Max number of completed apps kept in state store met: maxCompletedAppsInStateStore = 10000, removing app application_1555641507362_174834 from state store. 2019-05-08 05:06:59,921 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Application should be expired, max number of completed apps kept in memory met: maxCompletedAppsInMemory = 10000, removing app application_1555641507362_174834 from memory: 2019-05-08 05:06:59,921 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing info for app: application_1555641507362_174834 this is very normal. YARN's ResourceManager process remembers a maximum of 10,000 most recently completed applications and any currently active applications. Therefore, all applications will get displayed on the YARN ResourceManager Web UI - whether it be running, failed, finished etc., however, all "completed" applications are limited to 10,000 apps based on the setting "yarn.resourcemanager.max-completed-applications". The list of applications to be displayed in RM Web UI is refreshed internally within RM such that only the 10,000 most recent completed applications are visible along with any currently active applications. For example, let us say there are 20,000 applications that get run in one day in the cluster, with the first 10,000 completed within first 12 hours and the next set of 10,000 in the remaining 12 hours of the day. Also, assume that currently no application is shown in the RM Web UI (for example, when a restart of the RM/s have been performed and there is nothing in the process memory). After 12 hours, the first set of 10,000 applications would have all completed and they would still be viewable in the RM Web UI. However, the next application that finishes now will push the oldest one out. And another 12 hours later, none of the first set of 10,000 completed applications would still be visible as they would be completely replaced by the 2nd set of 10,000 completed applications. Thus, we will have a sliding window of 12 hours of completed applications in the RM Web UI. Coming back to the logs above, they are indicating that application_1555641507362_174834 will no longer be retained in RM process's memory because there are already 10,000 other recently completed applications that the RM is keeping track of. This is very normal as we cannot expect RM to hold an infinite number of apps in its memory.

SidAhuja · ‎05-08-2019

@jbowles Yes, it is advisable to clean up the NM local directories when changing LCE setting, please see https://www.cloudera.com/documentation/enterprise/5-10-x/topics/cdh_sg_other_hadoop_security.html#topic_18_3 Important: Configuration changes to the Linux container executor could result in local NodeManager directories (such as usercache) being left with incorrect permissions. To avoid this, when making changes using either Cloudera Manager or the command line, first manually remove the existing NodeManager local directories from all configured local directories (yarn.nodemanager.local-dirs), and let the NodeManager recreate the directory structure.

SidAhuja · ‎05-07-2019

Hey @Kamalindia , Thanks for checking. From the warning it seems that none of the NodeManagers were found to be healthy which I find odd. Can you kindly confirm if this alert is always present? Also, does this alert show up for any other CDH roles (other than NodeManager)? The reason I ask this question is I am wondering if Cloudera Manager's Service Monitor role that does these checks is itself having issues. You may want to check the ServiceMonitor logs to check for any errors and charts around JVM Heap usage etc. to see if Service Monitor itself is having issues resulting in these alerts. Another thing I can think of is to check the NodeManager logs to see if those NodeManagers are functioning or not and if YARN is able to allocate any containers to any of the NodeManagers.

SidAhuja · ‎05-07-2019

@Kamalindia The following warning: WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn IP=*.*.*.* OPERATION=refreshNodes TARGET=AdminService RESULT=FAILURE DESCRIPTION=ResourceManager is not active. Can not refresh nodes. PERMISSIONS= 2019-05-02 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:yarn/SERVER@HADOOP.COM (auth:KERBEROS) cause:org.apache.hadoop.ipc.StandbyException: ResourceManager rm137 is not Active! is a general warning that can show up in the Standby RM logs when a "refreshNodes" action has been performed internally within YARN (or explicitly through the dropdown when the Refresh Nodes option is accessible), for example, when a decommission or a recommission of a NodeManager happens etc. The above warning is benign and just highlights the fact that the Standby RM is rejecting a refreshNodes command which is expected as it is in Standby. Note that if a failover happens and the Standby RM is transitioned to Active, then, as part of that it will "refresh" anyway. Therefore, you may have to review the Active RM logs instead for the timeframe when you saw the bad health alert for NMs.

SidAhuja · ‎03-13-2019

Kindly refer to https://community.cloudera.com/t5/CDH-Manual-Installation/Upgrade-unmanaged-CDH-Cluster-to-6-1-from-5-16/m-p/87771#M1909 @tuk

Online	Offline
Last Visited	‎08-31-2021 02:56 PM

Member Since	‎10-12-2015 06:58 PM
Last Visited	‎08-31-2021 02:56 PM
Posts	191
Kudos received	1

Cloudera Community

Re: YARN node manager health alerts issue

Re: Can't create directory /yarn/nm/usercache/urik...

Re: YARN node manager health alerts issue

Re: YARN node manager health alerts issue

Re: Upgrade to CDH 6.0.x from 5.15