Support Questions
Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

Unable to start Node Manager

Contributor

Hi @nsabharwal 

 

Greetings.

 

Need your inputs and expertise on this topic. 

 

Details: 

       1. I have configured a FAIR_TEST queue and set the Ordering to FAIR

       2. Have added "fair-scheduler.xml" in HADOOP_CONF_DIR default path (/usr/hdp/3.1.0.0-78/hadoop/conf) and have set minResources and maxResources to 4 GB and 8 GB respectively. 

       3. Changed the Scheduler Class in Ambari to fair scheduler class and added a parameter "yarn.scheduler.fair.allocation.file" to point to the above XML file. 

 

While re-starting the YARN affected components in Ambari,  I am getting the below error:

 

Can you please let me know what's going wrong and how to fix this issue. 

 

2019-12-19 09:48:17,762 INFO service.AbstractService (AbstractService.java:noteFailure(267)) - Service NodeManager failed in state INITED
java.lang.RuntimeException: java.lang.RuntimeException: class org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler not org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2628)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.createContainerExecutor(NodeManager.java:347)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:389)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:933)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1013)
Caused by: java.lang.RuntimeException: class org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler not org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2622)

 

fair-scheduler.xml

<configuration xmlns:xi="http://www.w3.org/2001/XInclude">

<allocations>
<queue name="FAIR-TEST">
<minResources>4096 mb,0vcores</minResources>
<maxResources>8192 mb,0vcores</maxResources>
<maxRunningApps>50</maxRunningApps>
<maxAMShare>0.1</maxAMShare>
<weight>30</weight>
<schedulingPolicy>fair</schedulingPolicy>
</queue>

<queuePlacementPolicy>
<rule name="specified" />
<rule name="default" queue="FAIR-TEST" />
</queuePlacementPolicy>
</allocations>

13 REPLIES 13

Contributor

Hi All,

 

Any updates on the below issue?  I am facing lot of hurdles in getting this fixed. 

 

I would appreciate any quick inputs from any one in fixing the problem. 

 

@jsensharma @nsabharwal - I am a newbie to Cloudera Community and have seen that both of you are Gurus.  Can you please help me in fixing this issue? 

 

Thanks and Regards, 

Sudhindra

Contributor

Hi All,

 

Can you please help? 

 

It has been almost a week since I am stuck at the same issue.  I would appreciate any quick help on this critical issue, which is blocking my tasks. 

 

Thanks and Regards,

Sudhindra

Contributor

Hi @KuldeepK @chennuri_gauris @Shelton 

 

I need your help in looking into this issue.   All the details are given below. 

 

Please let me know if any additional information is required. 

 

Thanks and Regards,

Sudhindra 

Contributor

Hi @KuldeepK @chennuri_gauris @Shelton, 

 

This is just a gentle reminder. I really need your help in looking into the below issue. 

 

I want to understand what additional steps need to be done to setup a fair scheduler on YARN on HDP 3.1.0.0-78.

 

Thanks and Regards,

Sudhindra

Mentor

@ssk26 

I successfully configured the fair scheduler on the below HDP version

001a.PNG

Original scheduler 

003.PNG

        YARN UI

Default capacity scheduler after deployment of HDP

006.PNG

 

Pre-emption enabled  before the change to fair-scheduler

007.PNG

Grabbed the template fair-scheduler.xml  fair-scheduler here   I  then changed a few values  for testing  purposes but ensured the is valid using the XML using XML Validator  I then copied the fair-scheduler.xml to the $HADOOP_CONF directory and changed the user & permission

 

# cd /usr/hdp/3.1.0.0-78/hadoop/conf

# chown hdfs:hadoop fair-scheduler.xml

# chmod 644 fair-scheduler.xml


Changed the Scheduler class in the yarn-site.xml see the attached screenshot.

004.PNG

From : yarn.resourcemanager.scheduler.class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler

To: yarn.resourcemanager.scheduler.class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler

 

Added these new values in Custom yarn-site using the relative path default to /usr/hdp/3.1.0.0-78/hadoop/conf

yarn.scheduler.fair.allocation.file=fair-scheduler.xml

005.PNG

Custom yarn-site.xml

Changed the below mandatory parameter to enable the ReservationSystem in the ResourceManager is not enabled by default

yarn.resourcemanager.reservation-system.enable=true

 

Disable pre-emption

009.PNG

Set the below  properties as shown

The yarn-site.xml file contains parameters that determine scheduler-wide options. These properties include the below properties if they don't exist add them in the custom yarn-site 

 

Note

The below property was available so I didn't add

yarn.scheduler.capacity.ordering-policy.priority-utilization.underutilized-preemption.enabled=false

 

008b.PNG

 

Properties to verify

For my testing I didn't add the below properties, you will notice that above that despite disabling the pre-emption in the Ambari UI  the fair schedule shows it's enabled [True]  and my queues ain't showing I need to check my fair-scheduler.xml  attached is the template I used

012.PNG

yarn.scheduler.fair.assignmultiple=false
yarn.scheduler.fair.sizebasedweight=false

yarn.scheduler.fair.user-as-default-queue=true
yarn.scheduler.fair.preemption=false            Note: Do not use preemption when FairScheduler                                                                                                                    DominantResourceFairness is in

                                                                                                  use and node labels are present.

All in all, this shows the fair-scheduler configuration is doable and my RM is up and running !!
011.PNG

 

I also noticed that the above fair-scheduler was template  overwritten when I checked the YARN Queue Manager so that can now allow me to configure a new valid fair-scheduler

013.PNG

Happy Hadooping

 

 

Contributor

Hi @Shelton

 

Thanks for the reply.  I have been able to change the scheduling mode to Fair-Scheduler, which is great. 

 

However, my application is not running, due to resource allocation issue.  I am getting the below standard error.  

 

[Mon Dec 30 15:02:19 +0530 2019] Application is added to the scheduler and is not yet activated. (Resource request: <memory:1024, vCores:1> exceeds maximum AM resource allowed).

 

I am attaching all the relevant screenshots as well as information of my Yarn cluster for your reference. 

 

Please guide me in fixing this issue.  Why this issue usually occurs. 

 

My YARN cluster has 2 nodes, scheduling mode as "Fair-Scheduler", minimum allocation of 1 GB/1 vcores and maximum allocation of 15GB/3 vcores and overall memory is 30GB. 

 

YARN_Screenshot_1.PNG

Given below is "fair-scheduler.xml"  contents:

fair_scheduler_screenshot.PNG

Below is the custom yarn-site parameters that have been set and the preemption is disabled as well. 

 

yarn_site_screenshot.PNG

 

Contributor

Hi @Shelton

 

Please help me on this.  I am again stuck on the issue.  Did I wrongly configure anything?  

 

Also, even after setting preemption to false, in the YARN Resource Manager UI, I am able to see that the preemption is still enabled.  Is this causing the problem?

 

Thanks and Regards, 

Sudhindra

Contributor

Hi @Shelton ,

 

I have made some progress on this issue.  I have modified the fair-scheduler.xml and have set both "maxAMShare" and "queueMaxAMShareDefault" to 0.8 and weight to default value (1.0). 

 

The result:  One spark job is running fine. However, I am getting the same error as before on the exceeding of maximum AM resources limit, when I try to run the next job. 

 

The modified fair-scheduler.xml is given below.  Please provide your inputs on how to fix this particular issue. 

 

Also, one interesting observation is that, even though the YARN Scheduling mode is showing as "Fair", the Spark Scheduling mode is still showing as "FIFO".  Can I set it to "Fair" as well through the program?  Since I am setting spark.master as "YARN", I believe the Fair scheduling mode will take precedence over the Spark scheduling mode. Please correct me if I am wrong. 

 

fair_scheduler_screenshot_3.PNG

Mentor

@ssk26 

A queueMaxAMShareDefault and maxAMShare are mutually exclusive as its overridden by maxAMShare element in each queue.

Can you decrease it to queueMaxAMShareDefault or  maxAMShare to 0.1 and weight to 2.0

For the spark create the fairscheduler.xml from the fairscheduler.xml.template

your path might be different due to version 3.1.x.x.x.


# cp /usr/hdp/3.1.x.x-xx/etc/spark2/conf/fairscheduler.xml.template fairscheduler.xml

 

Please check the file permission


Then set spark.scheduler.allocation.file property in your SparkConf or either by putting a file named fairscheduler.xml on the classpath.

Note if no pools configured in the XML file will simply get default values for all settings (scheduling mode FIFO, weight 1, and minShare 0).

 

Here there are 2 default pools in fairscheduler.xml.template notably production and test using FAIR and FIFO

<allocations>
<pool name="production">
<schedulingMode>FAIR</schedulingMode>
<weight>1</weight>
<minShare>2</minShare>
</pool>
<pool name="test">
<schedulingMode>FIFO</schedulingMode>
<weight>2</weight>
<minShare>3</minShare>
</pool>
</allocations>

 

Without any intervention, newly submitted jobs go into a default pool, but jobs’ pools can be set by adding the spark.scheduler.pool “local property” to the SparkContext in the thread that’s submitting them. This is done as follows:

 

// Assuming sc is your SparkContext variable to pick the FAIR

sc.setLocalProperty("spark.scheduler.pool", "production")

 

Please let me know 

Contributor

Hi @Shelton ,

 

Thanks for your inputs.  If I understand correctly, there will be 2 fair-scheduler.xml files?  One for YARN kept in $HADOOP_CONF_DIR and one more in $SPARK_HOME? 

 

For fair-scheduler.xml belonging to Spark, how to configure the parameter in Ambari?  

 

Also, the queueMaxAMShareDefault or maxAMShare value - earlier it was 0.5 only - but since it was not launching the jobs due to the AM resource exceeded error, I did set it to 0.8 - I will try setting it to 0.1 and will check it. 

 

Please let me know your inputs.

 

Thanks and Regards,

Sudhindra

Contributor

Hi @Shelton ,

 

I went through your email again and tried out all the options that you have mentioned.  But, I am still facing the same issue, while running the second job. 

 

Please let me know if anything else needs to be set, or it is a pure memory related issue and kindly suggest on fixing this issue. 

 

Here are the relevant screenshots:

 

Screenshot 1:  YARN Fair Scheduler XML file (I tried setting maxAMShare to 0.1 - but the first spark job didn't start at all - so I had to bump it to 0.5) 

 

fair_scheduler_screenshot_4.PNG

Screenshot 2:  Spark Fair Scheduler XML file (this is placed in $SPARK_HOME/conf directory, i.e /usr/hdp/3.1.0.0-78/spark2/conf) 

spark_fair_scheduler_2.PNG

Screenshot 3:  Spark Configuration Parameters set through the pyspark program 

spark_fair_scheduler.PNG

Screenshot 4: YARN Cluster Information (Total number of VCORES is 6 and Total amount of memory present in 2 node cluster is 15.3 GB) 

Note: Since this is a flask application, it will launch 2 jobs I believe, one to open the port at 5000 and another to accept the inputs.  The whole idea behind this exercise is to test how many number of spark sessions can run at parallel in a single Spark Context. 

Yarn_Cluster_Status_1.PNG

Screenshot 5: This shows the percentage usage of queue as well as cluster by the first job. As we can see, there is sufficient space in both Cluster as well as Queue. But, for some reason, the second job never gets the required amount of resources. I know this could be because the fair-scheduler's maximum allocation is set to 3GB. Can you please let me know how to bump up this value.  I am also curious here - even though the maxResources in fair-scheduler.xml file is set to 8 GB, the fair scheduler's maximum allocation is set to 3 GB only. Is it because of the value of maxAMShare? 

 

Also, I am supplying both driver and executor memory to 512 MB only. How is my job occupying 3 GB of space? 

 

Yarn_Cluster_Status_2.PNG

Screenshot 6: This screenshot shows that the job 2 never gets the required amount of resources.

Yarn_Cluster_Status_3.PNG

 

Contributor

Hi @Shelton ,

 

Can you please help me in fixing this issue. 

 

With the same memory configuration as mentioned, I am able to run more than 1 spark job with Capacity Scheduler, while it's not possible to run the second Spark job with Fair Scheduler. 

 

I have already sent you the required screenshots. Please let me know your inputs at the earliest. 

 

Thanks and Regards,

Sudhindra

Contributor

Hi @Shelton ,

 

I was able to achieve my objective of running multiple Spark Sessions under a single Spark context using YARN capacity scheduler and Spark Fair Scheduling. 

 

However, the issue still remains with YARN fair scheduler.  The second Spark job is still not running (with the same memory configuration) due to lack of resources. 

 

So, what additional parameters need to be set for YARN fair scheduler to achieve this? 

 

Please help me in fixing this issue. 

 

Thanks and Regards,

Sudhindra