Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

GPU scheduling with YARN services in HDP 3.0

Solved Go to solution
Highlighted

GPU scheduling with YARN services in HDP 3.0

New Contributor

I'm trying to get GPU scheduling working with a YARN service app with docker. I have nvidia-docker v2 installed.

In the service description, I've configured resources as follows:

"resource": { "cpus": 2, "memory": "4096", "additional": { "yarn.io/gpu": { "value": 1 } } }

The app fails with following exception in node manager which indicates nvidia-docker-v1 REST API is required:

2018-09-17 14:19:43,490 WARN  gpu.NvidiaDockerV1CommandPlugin (NvidiaDockerV1CommandPlugin.java:init(145)) - IOException of NvidiaDockerV1CommandPlugin init:
java.net.ConnectException: Connection refused (Connection refused)

What is the recommended way of getting GPU scheduling working in HDP 3.0?

Do I have to downgrade to deprecated nvidia-docker v1? Or is there any other workaround?

1 ACCEPTED SOLUTION

Accepted Solutions

Re: GPU scheduling with YARN services in HDP 3.0

New Contributor

We only support nvidia-docker v1. We're looking at support of v2, not decided plans yet. v1 works nicely according to our current tests.

5 REPLIES 5

Re: GPU scheduling with YARN services in HDP 3.0

Rising Star
@Amila Silva

HDP 3.0 supports GPU isolation in docker using nvidia-docker-plugin https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker-plugin which is part of nvidia-docker v1. Currently only this is supported and not the newer version.

Re: GPU scheduling with YARN services in HDP 3.0

New Contributor

We only support nvidia-docker v1. We're looking at support of v2, not decided plans yet. v1 works nicely according to our current tests.

Re: GPU scheduling with YARN services in HDP 3.0

New Contributor

We only support nvidia-docker v1. We're looking at support of v2, not decided plans yet. v1 works nicely according to our current tests.

Re: GPU scheduling with YARN services in HDP 3.0

New Contributor

@wtan, @Tarun Parimi I've downgraded to nvidia-docker v1. REST API is also working. When I do curl localhost:3476/v1.0/docker/cli, I get:

--volume-driver=nvidia-docker --volume=nvidia_driver_396.44:/usr/local/nvidia:ro --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia-uvm-tools --device=/dev/nvidia0

Now when I try to run the YARN app, it fails with following exception:

java.io.IOException: Unable to prepare container:
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.prepareContainer(LinuxContainerExecutor.java:472)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.prepareContainer(ContainerLaunch.java:368)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:289)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:103)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=1: /usr/bin/nvidia-docker | 2018/09/25 15:45:56 Error: failed to run docker command

        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.runDockerVolumeCommand(DockerLinuxContainerRuntime.java:404)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.prepareContainer(DockerLinuxContainerRuntime.java:426)

I think this happens because the docker volume it's trying to create is already existing. When I run docker volume ls I get:

DRIVER              VOLUME NAME
nvidia-docker       nvidia_driver_396.44

Why is YARN creating this volume? Isn't it supposed to be handled by nvidia-docker?

Should I manually delete this existing volume? If so, will the volume automatically be deleted when the app is completed?

Re: GPU scheduling with YARN services in HDP 3.0

New Contributor

In case this helps someone, the issue was I had configured nvidia-docker as the docker binary. It needs to point to original docker binary.