Support Questions
Find answers, ask questions, and share your expertise

Configure GPU in mixed datanode cluster

Configure GPU in mixed datanode cluster

New Contributor

Hello,

 

I am currently trying to add some GPU in one of our HDP cluster. I follow this tutorial https://blog.cloudera.com/gpus-support-in-apache-hadoop-3-1-yarn-hdp-3/

But I am encountering an issue, the cluster is composed of 4 datanode running on CentOS 7, and only 2 have GPU. I installed nvidia-smi on the 2 nodes having GPU. But when I try to start YARN on the node without GPU, I got a crash because nvidia-smi could not be found.

I try to put my node in a configuration group without the gpu_module_enabled configuration but, I still got a crash with the following message

 

 

2020-02-11 17:16:17,579 ERROR gpu.GpuResourceHandlerImpl (GpuResourceHandlerImpl.java:bootstrap(77)) - Exception when trying to get usable GPU device
org.apache.hadoop.yarn.exceptions.YarnException: yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to auto, however automatically discovering GPU information failed, please check NodeManager log for more details, as an alternative, admin can specify yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices manually to enable GPU isolation.
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer.getGpusUsableByYarn(GpuDiscoverer.java:166)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.bootstrap(GpuResourceHandlerImpl.java:69)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.bootstrap(ResourceHandlerChain.java:58)
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:320)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:391)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:933)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1013)
2020-02-11 17:16:17,581 ERROR nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:init(323)) - Failed to bootstrap configured resource subsystems!
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: org.apache.hadoop.yarn.exceptions.YarnException: yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to auto, however automatically discovering GPU information failed, please check NodeManager log for more details, as an alternative, admin can specify yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices manually to enable GPU isolation.
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.bootstrap(GpuResourceHandlerImpl.java:78)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.bootstrap(ResourceHandlerChain.java:58)
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:320)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:391)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:933)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1013)
Caused by: org.apache.hadoop.yarn.exceptions.YarnException: yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to auto, however automatically discovering GPU information failed, please check NodeManager log for more details, as an alternative, admin can specify yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices manually to enable GPU isolation.
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer.getGpusUsableByYarn(GpuDiscoverer.java:166)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.bootstrap(GpuResourceHandlerImpl.java:69)
        ... 6 more
2020-02-11 17:16:17,583 INFO  service.AbstractService (AbstractService.java:noteFailure(267)) - Service NodeManager failed in state INITED
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize container executor
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:393)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:933)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1013)
Caused by: java.io.IOException: Failed to bootstrap configured resource subsystems!
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:324)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:391)
        ... 3 more
2020-02-11 17:16:17,584 ERROR nodemanager.NodeManager (NodeManager.java:initAndStartNodeManager(936)) - Error starting NodeManager
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize container executor
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:393)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:933)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1013)
Caused by: java.io.IOException: Failed to bootstrap configured resource subsystems!
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:324)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:391)
        ... 3 more

 

 

 

And when I try to install nvidia-smi, I got an error saying that the nvidia.ko kernel module is missing. Does anyone succeed to have such infrastructure to work, or am I suppose to have GPU on all nodes ?

 

Best regards