Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

GPU scheduling with YARN services in HDP 3.0

avatar

I'm trying to get GPU scheduling working with a YARN service app with docker. I have nvidia-docker v2 installed.

In the service description, I've configured resources as follows:

"resource": { "cpus": 2, "memory": "4096", "additional": { "yarn.io/gpu": { "value": 1 } } }

The app fails with following exception in node manager which indicates nvidia-docker-v1 REST API is required:

2018-09-17 14:19:43,490 WARN  gpu.NvidiaDockerV1CommandPlugin (NvidiaDockerV1CommandPlugin.java:init(145)) - IOException of NvidiaDockerV1CommandPlugin init:
java.net.ConnectException: Connection refused (Connection refused)

What is the recommended way of getting GPU scheduling working in HDP 3.0?

Do I have to downgrade to deprecated nvidia-docker v1? Or is there any other workaround?

1 ACCEPTED SOLUTION

avatar
Cloudera Employee

We only support nvidia-docker v1. We're looking at support of v2, not decided plans yet. v1 works nicely according to our current tests.

View solution in original post

5 REPLIES 5

avatar
Expert Contributor
@Amila Silva

HDP 3.0 supports GPU isolation in docker using nvidia-docker-plugin https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker-plugin which is part of nvidia-docker v1. Currently only this is supported and not the newer version.

avatar
Cloudera Employee

We only support nvidia-docker v1. We're looking at support of v2, not decided plans yet. v1 works nicely according to our current tests.

avatar
Cloudera Employee

We only support nvidia-docker v1. We're looking at support of v2, not decided plans yet. v1 works nicely according to our current tests.

avatar

@wtan, @Tarun Parimi I've downgraded to nvidia-docker v1. REST API is also working. When I do curl localhost:3476/v1.0/docker/cli, I get:

--volume-driver=nvidia-docker --volume=nvidia_driver_396.44:/usr/local/nvidia:ro --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia-uvm-tools --device=/dev/nvidia0

Now when I try to run the YARN app, it fails with following exception:

java.io.IOException: Unable to prepare container:
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.prepareContainer(LinuxContainerExecutor.java:472)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.prepareContainer(ContainerLaunch.java:368)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:289)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:103)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=1: /usr/bin/nvidia-docker | 2018/09/25 15:45:56 Error: failed to run docker command

        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.runDockerVolumeCommand(DockerLinuxContainerRuntime.java:404)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.prepareContainer(DockerLinuxContainerRuntime.java:426)

I think this happens because the docker volume it's trying to create is already existing. When I run docker volume ls I get:

DRIVER              VOLUME NAME
nvidia-docker       nvidia_driver_396.44

Why is YARN creating this volume? Isn't it supposed to be handled by nvidia-docker?

Should I manually delete this existing volume? If so, will the volume automatically be deleted when the app is completed?

avatar

In case this helps someone, the issue was I had configured nvidia-docker as the docker binary. It needs to point to original docker binary.