Support Questions

Find answers, ask questions, and share your expertise

GPU scheduling with YARN services in HDP 3.0

avatar

I'm trying to get GPU scheduling working with a YARN service app with docker. I have nvidia-docker v2 installed.

In the service description, I've configured resources as follows:

"resource": { "cpus": 2, "memory": "4096", "additional": { "yarn.io/gpu": { "value": 1 } } }

The app fails with following exception in node manager which indicates nvidia-docker-v1 REST API is required:

2018-09-17 14:19:43,490 WARN  gpu.NvidiaDockerV1CommandPlugin (NvidiaDockerV1CommandPlugin.java:init(145)) - IOException of NvidiaDockerV1CommandPlugin init:
java.net.ConnectException: Connection refused (Connection refused)

What is the recommended way of getting GPU scheduling working in HDP 3.0?

Do I have to downgrade to deprecated nvidia-docker v1? Or is there any other workaround?

1 ACCEPTED SOLUTION

avatar
Cloudera Employee

We only support nvidia-docker v1. We're looking at support of v2, not decided plans yet. v1 works nicely according to our current tests.

View solution in original post

5 REPLIES 5

avatar
Expert Contributor
@Amila Silva

HDP 3.0 supports GPU isolation in docker using nvidia-docker-plugin https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker-plugin which is part of nvidia-docker v1. Currently only this is supported and not the newer version.

avatar
Cloudera Employee

We only support nvidia-docker v1. We're looking at support of v2, not decided plans yet. v1 works nicely according to our current tests.

avatar
Cloudera Employee

We only support nvidia-docker v1. We're looking at support of v2, not decided plans yet. v1 works nicely according to our current tests.

avatar

@wtan, @Tarun Parimi I've downgraded to nvidia-docker v1. REST API is also working. When I do curl localhost:3476/v1.0/docker/cli, I get:

--volume-driver=nvidia-docker --volume=nvidia_driver_396.44:/usr/local/nvidia:ro --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia-uvm-tools --device=/dev/nvidia0

Now when I try to run the YARN app, it fails with following exception:

java.io.IOException: Unable to prepare container:
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.prepareContainer(LinuxContainerExecutor.java:472)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.prepareContainer(ContainerLaunch.java:368)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:289)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:103)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=1: /usr/bin/nvidia-docker | 2018/09/25 15:45:56 Error: failed to run docker command

        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.runDockerVolumeCommand(DockerLinuxContainerRuntime.java:404)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.prepareContainer(DockerLinuxContainerRuntime.java:426)

I think this happens because the docker volume it's trying to create is already existing. When I run docker volume ls I get:

DRIVER              VOLUME NAME
nvidia-docker       nvidia_driver_396.44

Why is YARN creating this volume? Isn't it supposed to be handled by nvidia-docker?

Should I manually delete this existing volume? If so, will the volume automatically be deleted when the app is completed?

avatar

In case this helps someone, the issue was I had configured nvidia-docker as the docker binary. It needs to point to original docker binary.