Created 12-10-2018 07:27 PM
I’m trying to set up a HDP 3.0 cluster. I have 5 worker nodes, only one of the nodes has GPU devices.
I’ve installed NodeManager on the 5 nodes.
I’ve configured GPU isolation in Ambari (yarn.nodemanager.resource-plugins=yarn.io/gpu among other things).
When I try to start NodeManagers without GPU, I have this error:
2018-12-10 16:51:51,039 - Execute['ulimit -c unlimited; export HADOOP_LIBEXEC_DIR=/usr/hdp/3.0.1.0-187/hadoop/libexec && /usr/hdp/3.0.1.0-187/hadoop-yarn/bin/yarn --config /usr/hdp/3.0.1.0-187/hadoop/conf --daemon start nodemanager'] {'not_if': 'ambari-sudo.sh-H -E test -f /var/run/hadoop-yarn/yarn/hadoop-yarn-nodemanager.pid && ambari-sudo.sh-H -E pgrep -F /var/run/hadoop-yarn/yarn/hadoop-yarn-nodemanager.pid', 'user': 'yarn'} 2018-12-10 16:51:53,235 - Execute['find /var/log/hadoop-yarn/yarn -maxdepth 1 -type f -name '*' -exec echo '==> {} <==' \; -exec tail -n 40 {} \;'] {'logoutput': True, 'ignore_failures': True, 'user': 'yarn'} ==> /var/log/hadoop-yarn/yarn/hadoop-yarn-nodemanager-xxxxxxxx.com.log <== at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:933) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1013) 2018-12-10 16:51:52,236 ERROR nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:init(323)) - Failed to bootstrap configured resource subsystems! org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: org.apache.hadoop.yarn.exceptions.YarnException: yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to auto, however automatically discovering GPU information failed, please check NodeManager log for more details, as an alternative, admin can specify yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices manually to enable GPU isolation. at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.bootstrap(GpuResourceHandlerImpl.java:78) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.bootstrap(ResourceHandlerChain.java:58) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:320) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:391) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:933) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1013) Caused by: org.apache.hadoop.yarn.exceptions.YarnException: yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to auto, however automatically discovering GPU information failed, please check NodeManager log for more details, as an alternative, admin can specify yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices manually to enable GPU isolation. at org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer.getGpusUsableByYarn(GpuDiscoverer.java:166) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.bootstrap(GpuResourceHandlerImpl.java:69) ... 6 more 2018-12-10 16:51:52,237 INFOservice.AbstractService (AbstractService.java:noteFailure(267)) - Service NodeManager failed in state INITED org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize container executor at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:393) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:933) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1013) Caused by: java.io.IOException: Failed to bootstrap configured resource subsystems! at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:324) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:391) ... 3 more 2018-12-10 16:51:52,238 ERROR nodemanager.NodeManager (NodeManager.java:initAndStartNodeManager(936)) - Error starting NodeManager org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize container executor at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:393) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:933) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1013) Caused by: java.io.IOException: Failed to bootstrap configured resource subsystems! at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:324) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:391) ... 3 more
If I disable GPU in Ambari, and try to run a Yarn job on the node with GPUS, I get the following error:
18/12/10 16:21:01 WARN distributedshell.Client: AM Resource capability= 18/12/10 16:21:01 ERROR distributedshell.Client: Error running Client org.apache.hadoop.yarn.exceptions.ResourceNotFoundException: Unknown resource: yarn.io/gpu at org.apache.hadoop.yarn.applications.distributedshell.Client.validateResourceTypes(Client.java:1218) at org.apache.hadoop.yarn.applications.distributedshell.Client.setContainerResources(Client.java:1204) at org.apache.hadoop.yarn.applications.distributedshell.Client.run(Client.java:735) at org.apache.hadoop.yarn.applications.distributedshell.Client.main(Client.java:265) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect. DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:318) at org.apache.hadoop.util.RunJar.main(RunJar.java:232)
What’s the correct way to configure YARN for such a heterogeneous cluster ?
Thanks a lot,
Jin
Created 12-17-2018 07:58 PM
Nodes without GPU will not start NodeManager if the GPU is enabled. Create config group in YARN and place the node(s) with GPU in that group. Then modify the configuration for that Config group for GPU isolation.
,Created 12-28-2018 08:28 PM
Thanks a lot !
Jin
Created 12-31-2018 03:12 PM
i am also getting same error.
cloudgpu-server:~/HDP# yarn jar /usr/hdp/3.1.0.0-78/hadoop-yarn/hadoop-yarn-applications-distributedshell.jar -jar /usr/hdp/3.1.0.0-78/hadoop-yarn/hadoop-yarn-applications-distributedshell.jar -shell_command /usr/bin/nvidia-smi -container_resources memory-mb=3072,vcores=1,yarn.io/gpu=1 -num_containers 2
18/12/31 17:04:34 INFO distributedshell.Client: Initializing Client
18/12/31 17:04:34 INFO distributedshell.Client: Running Client
18/12/31 17:04:34 INFO client.RMProxy: Connecting to ResourceManager at hostname/<ip_address>:8050
18/12/31 17:04:35 INFO client.AHSProxy: Connecting to Application History server at <hostname>:/<ip_address>:10200
18/12/31 17:04:35 INFO distributedshell.Client: Got Cluster metric info from ASM, numNodeManagers=4
18/12/31 17:04:35 INFO distributedshell.Client: Got Cluster node info from ASM
18/12/31 17:04:35 INFO distributedshell.Client: Got node report from ASM for, nodeId=cloudgpu-server.com:45454, nodeAddress=cloudgpu-server.com:8042, nodeRackName=/default-rack, nodeNumContainers=0
18/12/31 17:04:35 INFO distributedshell.Client: Got node report from ASM for, nodeId=<hostname>:45454, nodeAddress=<hostname>:8042, nodeRackName=/default-rack, nodeNumContainers=0
18/12/31 17:04:35 INFO distributedshell.Client: Got node report from ASM for, nodeId=<hostname>:45454, nodeAddress=<hostname>:8042, nodeRackName=/default-rack, nodeNumContainers=1
18/12/31 17:04:35 INFO distributedshell.Client: Got node report from ASM for, nodeId=<hostname>::45454, nodeAddress=<hostname>::8042, nodeRackName=/default-rack, nodeNumContainers=0
18/12/31 17:04:35 INFO distributedshell.Client: Queue info, queueName=default, queueCurrentCapacity=0.03125, queueMaxCapacity=1.0, queueApplicationCount=1, queueChildQueueCount=0
18/12/31 17:04:35 INFO distributedshell.Client: User ACL Info for Queue, queueName=root, userAcl=SUBMIT_APPLICATIONS
18/12/31 17:04:35 INFO distributedshell.Client: User ACL Info for Queue, queueName=root, userAcl=ADMINISTER_QUEUE
18/12/31 17:04:35 INFO distributedshell.Client: User ACL Info for Queue, queueName=default, userAcl=SUBMIT_APPLICATIONS
18/12/31 17:04:35 INFO distributedshell.Client: User ACL Info for Queue, queueName=default, userAcl=ADMINISTER_QUEUE
18/12/31 17:04:35 INFO distributedshell.Client: Max mem capability of resources in this cluster 8192
18/12/31 17:04:35 INFO distributedshell.Client: Max virtual cores capability of resources in this cluster 38
18/12/31 17:04:35 WARN distributedshell.Client: AM Memory not specified, use 100 mb as AM memory
18/12/31 17:04:35 WARN distributedshell.Client: AM vcore not specified, use 1 mb as AM vcores
18/12/31 17:04:35 WARN distributedshell.Client: AM Resource capability=<memory:100, vCores:1>
18/12/31 17:04:35 ERROR distributedshell.Client: Error running Client
org.apache.hadoop.yarn.exceptions.ResourceNotFoundException: Unknown resource: yarn.io/gpu
at org.apache.hadoop.yarn.applications.distributedshell.Client.validateResourceTypes(Client.java:1218)
at org.apache.hadoop.yarn.applications.distributedshell.Client.setContainerResources(Client.java:1204)
at org.apache.hadoop.yarn.applications.distributedshell.Client.run(Client.java:735)
at org.apache.hadoop.yarn.applications.distributedshell.Client.main(Client.java:265)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:318)
at org.apache.hadoop.util.RunJar.main(RunJar.java:232)
root@cloudgpu-server:~/HDP#
i have followed up steps
https://hortonworks.com/blog/gpus-support-in-apache-hadoop-3-1-yarn-hdp-3/#comment-26766
Can any one advise this ?