Created 06-27-2021 11:13 AM
we have 2 resource managers that are working as part of HDP cluster
the first resource manager is failed after couple minutes
from the log of the resource manager we can see the following lines that returned many times
2021-06-27 14:07:16,022 INFO scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(152)) - Assigned container container_e83_1624802728037_0001_01_004355 of capacity <memory:86016, vCores:5> on host datanode23.fgtf.com:45454, which has 5 containers, <memory:199680, vCores:15> used and <memory:27708, vCores:75> available after allocation
2021-06-27 14:07:01,279 INFO fair.FSAppAttempt (FSAppAttempt.java:reservationExceedsThreshold(495)) - Reservation Exceeds Allowed number of nodes: app_id=application_1624802728037_0001 existingReservations=10 totalAvailableNodes=181 reservableNodesRatio=0.05 numAllowedReservations=10
2021-06-27 14:07:01,279 INFO fair.FSAppAttempt (FSAppAttempt.java:reservationExceedsThreshold(495)) - Reservation Exceeds Allowed number of nodes: app_id=application_1624802728037_0004 existingReservations=10 totalAvailableNodes=181 reservableNodesRatio=0.05 numAllowedReservations=10
2021-06-27 14:07:01,279 INFO fair.FSAppAttempt (FSAppAttempt.java:reservationExceedsThreshold(495)) - Reservation Exceeds Allowed number of nodes: app_id=application_1624802728037_0003 existingReservations=10 totalAvailableNodes=181 reservableNodesRatio=0.05 numAllowedReservations=10
2021-06-27 14:07:01,279 INFO fair.FSAppAttempt (FSAppAttempt.java:reservationExceedsThreshold(495)) - Reservation Exceeds Allowed number of nodes:
AND
2021-06-27 14:07:01,279 INFO fair.FSAppAttempt (FSAppAttempt.java:reservationExceedsThreshold(495)) - Reservation Exceeds Allowed number of nodes: app_id=application_1624802728037_0001 existingReservations=10 totalAvailableNodes=181 reservableNodesRatio=0.05 numAllowedReservations=10
2021-06-27 14:07:01,279 INFO fair.FSAppAttempt (FSAppAttempt.java:reservationExceedsThreshold(495)) - Reservation Exceeds Allowed number of nodes: app_id=application_1624802728037_0004 existingReservations=10 totalAvailableNodes=181 reservableNodesRatio=0.05 numAllowedReservations=10
2021-06-27 14:07:01,279 INFO fair.FSAppAttempt (FSAppAttempt.java:reservationExceedsThreshold(495)) - Reservation Exceeds Allowed number of nodes: app_id=application_1624802728037_0003 existingReservations=10 totalAvailableNodes=181 reservableNodesRatio=0.05 numAllowedReservations=10
2021-06-27 14:07:01,279 INFO fair.FSAppAttempt (FSAppAttempt.java:reservationExceedsThreshold(495)) - Reservation Exceeds Allowed number of nodes:
and
2021-06-27 14:07:01,282 INFO webproxy.WebAppProxyServlet (WebAppProxyServlet.java:doGet(382)) - dr.who is accessing unchecked http://22.20.101.13:53092/api/v1/applications/application_1624802728037_0009/executors which is the app master GUI of application_1624802728037_0009 owned by hdfs
2021-06-27 14:07:01,282 INFO webproxy.WebAppProxyServlet (WebAppProxyServlet.java:doGet(382)) - dr.who is accessing unchecked http://22.20.14.5:36198/api/v1/applications/application_1624801538018_0011/executors which is the app master GUI of application_1624801538018_0011 owned by hdfs
any idea what is the meaning of the INFO about?
INFO fair.FSAppAttempt (FSAppAttempt.java:reservationExceedsThreshold(495)) - Reservation Exceeds Allowed number of nodes: app_id=application_1624802728037_0001 existingReservations=10 totalAvailableNodes=181 reservableNodesRatio=0.05 numAllowedReservations=10
Created 06-29-2021 05:55 AM
Have you recently changed your YARN configs [CapacityScheduler and the FairScheduler]? That seem related to over subscription?
Can you share your queue or capacity-scheduler.xml or fair-scheduler.xml.
Can you check the Ambari Ui-->YARN--> Config--> Version to ensure there is wasn't a change.
Geoffrey
Created 06-30-2021 04:55 AM
here the details
capacity-scheduler=null
yarn.scheduler.capacity.default.minimum-user-limit-percent=100
yarn.scheduler.capacity.maximum-am-resource-percent=0.2
yarn.scheduler.capacity.maximum-applications=10000
yarn.scheduler.capacity.node-locality-delay=40
yarn.scheduler.capacity.root.accessible-node-labels=*
yarn.scheduler.capacity.root.acl_administer_queue=*
yarn.scheduler.capacity.root.capacity=100
yarn.scheduler.capacity.root.default.acl_administer_jobs=*
yarn.scheduler.capacity.root.default.acl_submit_applications=*
yarn.scheduler.capacity.root.default.capacity=100
yarn.scheduler.capacity.root.default.maximum-capacity=100
yarn.scheduler.capacity.root.default.state=RUNNING
yarn.scheduler.capacity.root.default.user-limit-factor=1
yarn.scheduler.capacity.root.queues=default
Created 06-30-2021 04:56 AM
and
yarn.scheduler.capacity.ordering-policy.priority-utilization.underutilized-preemption.enabled --> false
Created 07-01-2021 01:29 AM
based on the logs and based on what you see
what is the preferred selecting that we need to use
CapacityScheduler OR FairScheduler ?
Created on 07-01-2021 04:44 AM - edited 07-01-2021 04:45 AM
Can you share the below files in /var/log/hadoop-yarn/yarn
Happy hadooping
Created 07-02-2021 03:03 AM
@mike_bronson7
Waiting for your response with the logs.
Created 07-17-2021 11:43 PM
the logs are include some sensitive data , so I cant to attached all the log content , but the lines that I posted are the lines that are popular in the logs , also I forget to mention another problem that we cant also to access port 8088
Created 07-18-2021 02:10 PM
Are you using the default capacity schedule settings? No queues/leafs created? Is what you shared the current seeting?