Support Questions
Find answers, ask questions, and share your expertise

GPU scheduling with YARN services in HDP 3.0

Solved Go to solution

GPU scheduling with YARN services in HDP 3.0

I'm trying to get GPU scheduling working with a YARN service app with docker. I have nvidia-docker v2 installed.

In the service description, I've configured resources as follows:

"resource": { "cpus": 2, "memory": "4096", "additional": { "yarn.io/gpu": { "value": 1 } } }

The app fails with following exception in node manager which indicates nvidia-docker-v1 REST API is required:

2018-09-17 14:19:43,490 WARN  gpu.NvidiaDockerV1CommandPlugin (NvidiaDockerV1CommandPlugin.java:init(145)) - IOException of NvidiaDockerV1CommandPlugin init:
java.net.ConnectException: Connection refused (Connection refused)

What is the recommended way of getting GPU scheduling working in HDP 3.0?

Do I have to downgrade to deprecated nvidia-docker v1? Or is there any other workaround?

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: GPU scheduling with YARN services in HDP 3.0

Cloudera Employee

We only support nvidia-docker v1. We're looking at support of v2, not decided plans yet. v1 works nicely according to our current tests.

View solution in original post

5 REPLIES 5
Highlighted

Re: GPU scheduling with YARN services in HDP 3.0

Rising Star
@Amila Silva

HDP 3.0 supports GPU isolation in docker using nvidia-docker-plugin https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker-plugin which is part of nvidia-docker v1. Currently only this is supported and not the newer version.

Highlighted

Re: GPU scheduling with YARN services in HDP 3.0

Cloudera Employee

We only support nvidia-docker v1. We're looking at support of v2, not decided plans yet. v1 works nicely according to our current tests.

Highlighted

Re: GPU scheduling with YARN services in HDP 3.0

Cloudera Employee

We only support nvidia-docker v1. We're looking at support of v2, not decided plans yet. v1 works nicely according to our current tests.

View solution in original post

Highlighted

Re: GPU scheduling with YARN services in HDP 3.0

@wtan, @Tarun Parimi I've downgraded to nvidia-docker v1. REST API is also working. When I do curl localhost:3476/v1.0/docker/cli, I get:

--volume-driver=nvidia-docker --volume=nvidia_driver_396.44:/usr/local/nvidia:ro --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia-uvm-tools --device=/dev/nvidia0

Now when I try to run the YARN app, it fails with following exception:

java.io.IOException: Unable to prepare container:
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.prepareContainer(LinuxContainerExecutor.java:472)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.prepareContainer(ContainerLaunch.java:368)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:289)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:103)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=1: /usr/bin/nvidia-docker | 2018/09/25 15:45:56 Error: failed to run docker command

        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.runDockerVolumeCommand(DockerLinuxContainerRuntime.java:404)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.prepareContainer(DockerLinuxContainerRuntime.java:426)

I think this happens because the docker volume it's trying to create is already existing. When I run docker volume ls I get:

DRIVER              VOLUME NAME
nvidia-docker       nvidia_driver_396.44

Why is YARN creating this volume? Isn't it supposed to be handled by nvidia-docker?

Should I manually delete this existing volume? If so, will the volume automatically be deleted when the app is completed?

Highlighted

Re: GPU scheduling with YARN services in HDP 3.0

In case this helps someone, the issue was I had configured nvidia-docker as the docker binary. It needs to point to original docker binary.