Created on 07-29-2019 08:23 PM - last edited on 08-27-2019 07:36 AM by cjervis
Getting confused when trying to run a YARN process and getting errors. Looking in ambari UI YARN section, seeing...
(note it says 60GB available). Yet, when trying to run an YARN process, getting errors indicating that there are less resources available than is being reported in ambari, see...
➜ h2o-3.26.0.2-hdp3.1 hadoop jar h2odriver.jar -nodes 4 -mapperXmx 5g -output /home/ml1/hdfsOutputDir Determining driver host interface for mapper->driver callback... [Possible callback IP address: 192.168.122.1] [Possible callback IP address: 172.18.4.49] [Possible callback IP address: 127.0.0.1] Using mapper->driver callback IP address and port: 172.18.4.49:46721 (You can override these with -driverif and -driverport/-driverportrange and/or specify external IP using -extdriverif.) Memory Settings: mapreduce.map.java.opts: -Xms5g -Xmx5g -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dlog4j.defaultInitOverride=true Extra memory percent: 10 mapreduce.map.memory.mb: 5632 Hive driver not present, not generating token. 19/08/07 12:37:19 INFO client.RMProxy: Connecting to ResourceManager at hw01.ucera.local/172.18.4.46:8050 19/08/07 12:37:19 INFO client.AHSProxy: Connecting to Application History server at hw02.ucera.local/172.18.4.47:10200 19/08/07 12:37:19 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/ml1/.staging/job_1565057088651_0007 19/08/07 12:37:21 INFO mapreduce.JobSubmitter: number of splits:4 19/08/07 12:37:21 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1565057088651_0007 19/08/07 12:37:21 INFO mapreduce.JobSubmitter: Executing with tokens: [] 19/08/07 12:37:21 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/3.1.0.0-78/0/resource-types.xml 19/08/07 12:37:21 INFO impl.YarnClientImpl: Submitted application application_1565057088651_0007 19/08/07 12:37:21 INFO mapreduce.Job: The url to track the job: http://HW01.ucera.local:8088/proxy/application_1565057088651_0007/ Job name 'H2O_80092' submitted JobTracker job ID is 'job_1565057088651_0007' For YARN users, logs command is 'yarn logs -applicationId application_1565057088651_0007' Waiting for H2O cluster to come up... 19/08/07 12:37:38 INFO client.RMProxy: Connecting to ResourceManager at hw01.ucera.local/172.18.4.46:8050 19/08/07 12:37:38 INFO client.AHSProxy: Connecting to Application History server at hw02.ucera.local/172.18.4.47:10200 ----- YARN cluster metrics ----- Number of YARN worker nodes: 4 ----- Nodes ----- Node: http://HW03.ucera.local:8042 Rack: /default-rack, RUNNING, 1 containers used, 5.0 / 15.0 GB used, 1 / 3 vcores used Node: http://HW04.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used Node: http://hw05.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used Node: http://HW02.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used ----- Queues ----- Queue name: default Queue state: RUNNING Current capacity: 0.08 Capacity: 1.00 Maximum capacity: 1.00 Application count: 1 ----- Applications in this queue ----- Application ID: application_1565057088651_0007 (H2O_80092) Started: ml1 (Wed Aug 07 12:37:21 HST 2019) Application state: FINISHED Tracking URL: http://HW01.ucera.local:8088/proxy/application_1565057088651_0007/ Queue name: default Used/Reserved containers: 1 / 0 Needed/Used/Reserved memory: 5.0 GB / 5.0 GB / 0.0 GB Needed/Used/Reserved vcores: 1 / 1 / 0 Queue 'default' approximate utilization: 5.0 / 60.0 GB used, 1 / 12 vcores used ---------------------------------------------------------------------- ERROR: Unable to start any H2O nodes; please contact your YARN administrator. A common cause for this is the requested container size (5.5 GB) exceeds the following YARN settings: yarn.nodemanager.resource.memory-mb yarn.scheduler.maximum-allocation-mb ---------------------------------------------------------------------- For YARN users, logs command is 'yarn logs -applicationId application_1565057088651_0007'
Note the
ERROR: Unable to start any H2O nodes; please contact your YARN administrator.
A common cause for this is the requested container size (5.5 GB) exceeds the following YARN settings:
yarn.nodemanager.resource.memory-mb
yarn.scheduler.maximum-allocation-mb
Yet, I have YARN configured with
yarn.scheduler.maximum-allocation-vcores=3
yarn.nodemanager.resource.cpu-vcores=3
yarn.nodemanager.resource.memory-mb=15GB
yarn.scheduler.maximum-allocation-mb=15GB
and we can see both container and node resource restrictions are higher than the requested container size.
So there are some things about this that I don't understand
Queue 'default' approximate utilization: 5.0 / 60.0 GB used, 1 / 12 vcores used
I would like to use the full 60GB that YARN can ostensibly provide (or at least have the option to, rather than have errors thrown). Would think that there should be enough resources to have each of the 4 nodes provide 15GB (> requested 4x5GB=20GB) to the process. Am I missing something here? Note that I only have the default root queue setup for YARN?
----- Nodes -----
Node: http://HW03.ucera.local:8042 Rack: /default-rack, RUNNING, 1 containers used, 5.0 / 15.0 GB used, 1 / 3 vcores used
Node: http://HW04.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
....
Why is only a single node being used before erroring out?
From these two things, it seems that neither the 15GB node limit nor the 60GB cluster limit are being exceeded, so why are these errors being thrown? What about this situation am I misinterpreting here? What can be done to fix (again, would like to be able to use all of the apparent 60GB of YARN resources for the job without error)? Any debugging suggestions of fixes?
Created 08-26-2019 06:54 PM
Problem appears to be related to How to properly change uid for HDP / ambari-created user? and the fact that having a user exist on a node and have a hdfs://user/<username> directory with correct permissions (as I was lead to believe from a Hortonworks forum post) is not sufficient to be acknowledges as "existing" on the cluster.
Running the hadoop jar command for a different user (in this case, the Ambari-created hdfs user) that exists on all cluster nodes (even though Ambari created this user having different uids across nodes (IDK if this is a problem)) and has a hdfs://user/hdfs dir, found that the h2o jar ran as expected. Will look into this a bit more before posting as an answer. I think basically will need to look for a bit more clarification as to when HDP considers a user to "exist" on a cluster.