Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Configure GPU in mixed datanode cluster

Configure GPU in mixed datanode cluster

New Contributor

Hello,

 

I am currently trying to add some GPU in one of our HDP cluster. I follow this tutorial https://blog.cloudera.com/gpus-support-in-apache-hadoop-3-1-yarn-hdp-3/

But I am encountering an issue, the cluster is composed of 4 datanode running on CentOS 7, and only 2 have GPU. I installed nvidia-smi on the 2 nodes having GPU. But when I try to start YARN on the node without GPU, I got a crash because nvidia-smi could not be found.

I try to put my node in a configuration group without the gpu_module_enabled configuration but, I still got a crash with the following message

 

 

2020-02-11 17:16:17,579 ERROR gpu.GpuResourceHandlerImpl (GpuResourceHandlerImpl.java:bootstrap(77)) - Exception when trying to get usable GPU device
org.apache.hadoop.yarn.exceptions.YarnException: yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to auto, however automatically discovering GPU information failed, please check NodeManager log for more details, as an alternative, admin can specify yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices manually to enable GPU isolation.
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer.getGpusUsableByYarn(GpuDiscoverer.java:166)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.bootstrap(GpuResourceHandlerImpl.java:69)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.bootstrap(ResourceHandlerChain.java:58)
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:320)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:391)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:933)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1013)
2020-02-11 17:16:17,581 ERROR nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:init(323)) - Failed to bootstrap configured resource subsystems!
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: org.apache.hadoop.yarn.exceptions.YarnException: yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to auto, however automatically discovering GPU information failed, please check NodeManager log for more details, as an alternative, admin can specify yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices manually to enable GPU isolation.
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.bootstrap(GpuResourceHandlerImpl.java:78)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.bootstrap(ResourceHandlerChain.java:58)
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:320)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:391)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:933)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1013)
Caused by: org.apache.hadoop.yarn.exceptions.YarnException: yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to auto, however automatically discovering GPU information failed, please check NodeManager log for more details, as an alternative, admin can specify yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices manually to enable GPU isolation.
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer.getGpusUsableByYarn(GpuDiscoverer.java:166)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.bootstrap(GpuResourceHandlerImpl.java:69)
        ... 6 more
2020-02-11 17:16:17,583 INFO  service.AbstractService (AbstractService.java:noteFailure(267)) - Service NodeManager failed in state INITED
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize container executor
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:393)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:933)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1013)
Caused by: java.io.IOException: Failed to bootstrap configured resource subsystems!
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:324)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:391)
        ... 3 more
2020-02-11 17:16:17,584 ERROR nodemanager.NodeManager (NodeManager.java:initAndStartNodeManager(936)) - Error starting NodeManager
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize container executor
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:393)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:933)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1013)
Caused by: java.io.IOException: Failed to bootstrap configured resource subsystems!
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:324)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:391)
        ... 3 more

 

 

 

And when I try to install nvidia-smi, I got an error saying that the nvidia.ko kernel module is missing. Does anyone succeed to have such infrastructure to work, or am I suppose to have GPU on all nodes ?

 

Best regards

 

Don't have an account?
Coming from Hortonworks? Activate your account here