Created 09-18-2018 03:00 AM
I'm trying to get GPU scheduling working with a YARN service app with docker. I have nvidia-docker v2 installed.
In the service description, I've configured resources as follows:
"resource": { "cpus": 2, "memory": "4096", "additional": { "yarn.io/gpu": { "value": 1 } } }
The app fails with following exception in node manager which indicates nvidia-docker-v1 REST API is required:
2018-09-17 14:19:43,490 WARN gpu.NvidiaDockerV1CommandPlugin (NvidiaDockerV1CommandPlugin.java:init(145)) - IOException of NvidiaDockerV1CommandPlugin init: java.net.ConnectException: Connection refused (Connection refused)
What is the recommended way of getting GPU scheduling working in HDP 3.0?
Do I have to downgrade to deprecated nvidia-docker v1? Or is there any other workaround?
Created 09-18-2018 03:42 PM
We only support nvidia-docker v1. We're looking at support of v2, not decided plans yet. v1 works nicely according to our current tests.
Created 09-18-2018 06:44 AM
HDP 3.0 supports GPU isolation in docker using nvidia-docker-plugin https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker-plugin which is part of nvidia-docker v1. Currently only this is supported and not the newer version.
Created 09-18-2018 03:42 PM
We only support nvidia-docker v1. We're looking at support of v2, not decided plans yet. v1 works nicely according to our current tests.
Created 09-18-2018 03:42 PM
We only support nvidia-docker v1. We're looking at support of v2, not decided plans yet. v1 works nicely according to our current tests.
Created 09-25-2018 05:22 PM
@wtan, @Tarun Parimi I've downgraded to nvidia-docker v1. REST API is also working. When I do curl localhost:3476/v1.0/docker/cli, I get:
--volume-driver=nvidia-docker --volume=nvidia_driver_396.44:/usr/local/nvidia:ro --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia-uvm-tools --device=/dev/nvidia0
Now when I try to run the YARN app, it fails with following exception:
java.io.IOException: Unable to prepare container: at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.prepareContainer(LinuxContainerExecutor.java:472) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.prepareContainer(ContainerLaunch.java:368) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:289) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:103) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=1: /usr/bin/nvidia-docker | 2018/09/25 15:45:56 Error: failed to run docker command at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.runDockerVolumeCommand(DockerLinuxContainerRuntime.java:404) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.prepareContainer(DockerLinuxContainerRuntime.java:426)
I think this happens because the docker volume it's trying to create is already existing. When I run docker volume ls I get:
DRIVER VOLUME NAME nvidia-docker nvidia_driver_396.44
Why is YARN creating this volume? Isn't it supposed to be handled by nvidia-docker?
Should I manually delete this existing volume? If so, will the volume automatically be deleted when the app is completed?
Created 10-01-2018 03:19 PM
In case this helps someone, the issue was I had configured nvidia-docker as the docker binary. It needs to point to original docker binary.