Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to configure YARN in a HDP 3.0 cluster where some servers have GPUs and some others not ?

Highlighted

How to configure YARN in a HDP 3.0 cluster where some servers have GPUs and some others not ?

New Contributor

I’m trying to set up a HDP 3.0 cluster. I have 5 worker nodes, only one of the nodes has GPU devices.

I’ve installed NodeManager on the 5 nodes.

I’ve configured GPU isolation in Ambari (yarn.nodemanager.resource-plugins=yarn.io/gpu among other things).

When I try to start NodeManagers without GPU, I have this error:

	2018-12-10 16:51:51,039 - Execute['ulimit
-c unlimited; export HADOOP_LIBEXEC_DIR=/usr/hdp/3.0.1.0-187/hadoop/libexec
&& /usr/hdp/3.0.1.0-187/hadoop-yarn/bin/yarn --config
/usr/hdp/3.0.1.0-187/hadoop/conf --daemon start nodemanager'] {'not_if': 'ambari-sudo.sh-H -E test -f
/var/run/hadoop-yarn/yarn/hadoop-yarn-nodemanager.pid &&
ambari-sudo.sh-H -E pgrep -F
/var/run/hadoop-yarn/yarn/hadoop-yarn-nodemanager.pid', 'user': 'yarn'}
	2018-12-10 16:51:53,235 - Execute['find
/var/log/hadoop-yarn/yarn -maxdepth 1 -type f -name '*' -exec echo '==> {}
<==' \; -exec tail -n 40 {} \;'] {'logoutput': True, 'ignore_failures':
True, 'user': 'yarn'}
	==>
/var/log/hadoop-yarn/yarn/hadoop-yarn-nodemanager-xxxxxxxx.com.log <==
	at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)

	at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:933)

	at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1013)

	2018-12-10 16:51:52,236 ERROR
nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:init(323)) -
Failed to bootstrap configured resource subsystems!

	org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
org.apache.hadoop.yarn.exceptions.YarnException:
yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to auto,
however automatically discovering GPU information failed, please check
NodeManager log for more details, as an alternative, admin can specify yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices
manually to enable GPU isolation.

	at
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.bootstrap(GpuResourceHandlerImpl.java:78)

	at
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.bootstrap(ResourceHandlerChain.java:58)

	at
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:320)

	at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:391)

	at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)

	at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:933)

	at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1013)

	Caused by:
org.apache.hadoop.yarn.exceptions.YarnException:
yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to auto,
however automatically discovering GPU information failed, please check
NodeManager log for more details, as an alternative, admin can specify
yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices manually to enable
GPU isolation.

	at
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer.getGpusUsableByYarn(GpuDiscoverer.java:166)

	at
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.bootstrap(GpuResourceHandlerImpl.java:69)
	...
6 more

	2018-12-10 16:51:52,237 INFOservice.AbstractService
(AbstractService.java:noteFailure(267)) - Service NodeManager failed in state
INITED

	org.apache.hadoop.yarn.exceptions.YarnRuntimeException:
Failed to initialize container executor

	at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:393)

	at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)

	at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:933)

	at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1013)

	Caused by: java.io.IOException: Failed to
bootstrap configured resource subsystems!

	at
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:324)

	at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:391)

	...
3 more

	2018-12-10 16:51:52,238 ERROR
nodemanager.NodeManager (NodeManager.java:initAndStartNodeManager(936)) - Error
starting NodeManager

	org.apache.hadoop.yarn.exceptions.YarnRuntimeException:
Failed to initialize container executor

	at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:393)

	at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)

	at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:933)

	at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1013)

	Caused by: java.io.IOException: Failed to
bootstrap configured resource subsystems!

	at
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:324)

	at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:391)

	... 3 more

If I disable GPU in Ambari, and try to run a Yarn job on the node with GPUS, I get the following error:

	18/12/10 16:21:01 WARN
distributedshell.Client: AM Resource capability=

	18/12/10 16:21:01 ERROR
distributedshell.Client: Error running Client

	org.apache.hadoop.yarn.exceptions.ResourceNotFoundException:
Unknown resource: yarn.io/gpu

	at
org.apache.hadoop.yarn.applications.distributedshell.Client.validateResourceTypes(Client.java:1218)

	at
org.apache.hadoop.yarn.applications.distributedshell.Client.setContainerResources(Client.java:1204)

	at org.apache.hadoop.yarn.applications.distributedshell.Client.run(Client.java:735)

	at
org.apache.hadoop.yarn.applications.distributedshell.Client.main(Client.java:265)

	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

	at
sun.reflect.  DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

	at java.lang.reflect.Method.invoke(Method.java:498)

	at org.apache.hadoop.util.RunJar.run(RunJar.java:318)

	at org.apache.hadoop.util.RunJar.main(RunJar.java:232)

What’s the correct way to configure YARN for such a heterogeneous cluster ?

Thanks a lot,

Jin

3 REPLIES 3

Re: How to configure YARN in a HDP 3.0 cluster where some servers have GPUs and some others not ?

New Contributor

Nodes without GPU will not start NodeManager if the GPU is enabled. Create config group in YARN and place the node(s) with GPU in that group. Then modify the configuration for that Config group for GPU isolation.

,

Re: How to configure YARN in a HDP 3.0 cluster where some servers have GPUs and some others not ?

New Contributor

Thanks a lot !

Jin

Re: How to configure YARN in a HDP 3.0 cluster where some servers have GPUs and some others not ?

New Contributor

i am also getting same error.

cloudgpu-server:~/HDP# yarn jar /usr/hdp/3.1.0.0-78/hadoop-yarn/hadoop-yarn-applications-distributedshell.jar -jar /usr/hdp/3.1.0.0-78/hadoop-yarn/hadoop-yarn-applications-distributedshell.jar -shell_command /usr/bin/nvidia-smi -container_resources memory-mb=3072,vcores=1,yarn.io/gpu=1 -num_containers 2

18/12/31 17:04:34 INFO distributedshell.Client: Initializing Client

18/12/31 17:04:34 INFO distributedshell.Client: Running Client

18/12/31 17:04:34 INFO client.RMProxy: Connecting to ResourceManager at hostname/<ip_address>:8050

18/12/31 17:04:35 INFO client.AHSProxy: Connecting to Application History server at <hostname>:/<ip_address>:10200

18/12/31 17:04:35 INFO distributedshell.Client: Got Cluster metric info from ASM, numNodeManagers=4

18/12/31 17:04:35 INFO distributedshell.Client: Got Cluster node info from ASM

18/12/31 17:04:35 INFO distributedshell.Client: Got node report from ASM for, nodeId=cloudgpu-server.com:45454, nodeAddress=cloudgpu-server.com:8042, nodeRackName=/default-rack, nodeNumContainers=0

18/12/31 17:04:35 INFO distributedshell.Client: Got node report from ASM for, nodeId=<hostname>:45454, nodeAddress=<hostname>:8042, nodeRackName=/default-rack, nodeNumContainers=0

18/12/31 17:04:35 INFO distributedshell.Client: Got node report from ASM for, nodeId=<hostname>:45454, nodeAddress=<hostname>:8042, nodeRackName=/default-rack, nodeNumContainers=1

18/12/31 17:04:35 INFO distributedshell.Client: Got node report from ASM for, nodeId=<hostname>::45454, nodeAddress=<hostname>::8042, nodeRackName=/default-rack, nodeNumContainers=0

18/12/31 17:04:35 INFO distributedshell.Client: Queue info, queueName=default, queueCurrentCapacity=0.03125, queueMaxCapacity=1.0, queueApplicationCount=1, queueChildQueueCount=0

18/12/31 17:04:35 INFO distributedshell.Client: User ACL Info for Queue, queueName=root, userAcl=SUBMIT_APPLICATIONS

18/12/31 17:04:35 INFO distributedshell.Client: User ACL Info for Queue, queueName=root, userAcl=ADMINISTER_QUEUE

18/12/31 17:04:35 INFO distributedshell.Client: User ACL Info for Queue, queueName=default, userAcl=SUBMIT_APPLICATIONS

18/12/31 17:04:35 INFO distributedshell.Client: User ACL Info for Queue, queueName=default, userAcl=ADMINISTER_QUEUE

18/12/31 17:04:35 INFO distributedshell.Client: Max mem capability of resources in this cluster 8192

18/12/31 17:04:35 INFO distributedshell.Client: Max virtual cores capability of resources in this cluster 38

18/12/31 17:04:35 WARN distributedshell.Client: AM Memory not specified, use 100 mb as AM memory

18/12/31 17:04:35 WARN distributedshell.Client: AM vcore not specified, use 1 mb as AM vcores

18/12/31 17:04:35 WARN distributedshell.Client: AM Resource capability=<memory:100, vCores:1>

18/12/31 17:04:35 ERROR distributedshell.Client: Error running Client

org.apache.hadoop.yarn.exceptions.ResourceNotFoundException: Unknown resource: yarn.io/gpu

at org.apache.hadoop.yarn.applications.distributedshell.Client.validateResourceTypes(Client.java:1218)

at org.apache.hadoop.yarn.applications.distributedshell.Client.setContainerResources(Client.java:1204)

at org.apache.hadoop.yarn.applications.distributedshell.Client.run(Client.java:735)

at org.apache.hadoop.yarn.applications.distributedshell.Client.main(Client.java:265)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:318)

at org.apache.hadoop.util.RunJar.main(RunJar.java:232)

root@cloudgpu-server:~/HDP#

i have followed up steps

https://hortonworks.com/blog/gpus-support-in-apache-hadoop-3-1-yarn-hdp-3/#comment-26766

Can any one advise this ?