Member since
08-14-2018
6
Posts
1
Kudos Received
0
Solutions
10-01-2018
03:19 PM
1 Kudo
In case this helps someone, the issue was I had configured nvidia-docker as the docker binary. It needs to point to original docker binary.
... View more
09-25-2018
05:22 PM
@wtan, @Tarun Parimi I've downgraded to nvidia-docker v1. REST API is also working. When I do curl localhost:3476/v1.0/docker/cli, I get: --volume-driver=nvidia-docker --volume=nvidia_driver_396.44:/usr/local/nvidia:ro --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia-uvm-tools --device=/dev/nvidia0 Now when I try to run the YARN app, it fails with following exception: java.io.IOException: Unable to prepare container:
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.prepareContainer(LinuxContainerExecutor.java:472)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.prepareContainer(ContainerLaunch.java:368)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:289)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:103)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=1: /usr/bin/nvidia-docker | 2018/09/25 15:45:56 Error: failed to run docker command
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.runDockerVolumeCommand(DockerLinuxContainerRuntime.java:404)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.prepareContainer(DockerLinuxContainerRuntime.java:426)
I think this happens because the docker volume it's trying to create is already existing. When I run docker volume ls I get: DRIVER VOLUME NAME
nvidia-docker nvidia_driver_396.44
Why is YARN creating this volume? Isn't it supposed to be handled by nvidia-docker? Should I manually delete this existing volume? If so, will the volume automatically be deleted when the app is completed?
... View more
09-18-2018
03:00 AM
I'm trying to get GPU scheduling working with a YARN service app with docker. I have nvidia-docker v2 installed.
In the service description, I've configured resources as follows:
"resource": {
"cpus": 2,
"memory": "4096",
"additional": {
"yarn.io/gpu": {
"value": 1
}
}
}
The app fails with following exception in node manager which indicates nvidia-docker-v1 REST API is required: 2018-09-17 14:19:43,490 WARN gpu.NvidiaDockerV1CommandPlugin (NvidiaDockerV1CommandPlugin.java:init(145)) - IOException of NvidiaDockerV1CommandPlugin init:
java.net.ConnectException: Connection refused (Connection refused)
What is the recommended way of getting GPU scheduling working in HDP 3.0?
Do I have to downgrade to deprecated nvidia-docker v1? Or is there any other workaround?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache YARN
-
Docker
08-27-2018
03:50 PM
I'm trying out a a YARN service application with Docker on HDP3.0. I execute a script inside the Docker container, and it completes with status 0 after a while. I want the application to end when this happens. But YARN keeps spawning a new container whenever a container completes. Is this the intended behavior? I can see this log in serviceam.log 2018-08-26 04:29:30,193 [Component dispatcher] INFO component.Component - [COMPONENT mycomp] Transitioned from STABLE to FLEXING on CONTAINER_COMPLETED event. How do I tell YARN that my application has successfully completed?
... View more
Labels:
- Labels:
-
Apache YARN
08-15-2018
08:00 AM
@amarnath reddy pappu, @Jay Kumar SenSharma Thanks for the answers. Yes I knew Hadoop 3.1.0 is the version included in HDP 3.0. How can I expect these Hadoop 3.1.1 fixes with HDP in the future. a) With a new Ambari version? b) Updated repos in Ambari installation wizard? c) Or with a new HDP release (like HDP 3.0.1)?
... View more
08-14-2018
11:13 PM
Hadoop 3.1.1 release includes 435 fixed JIRAs since 3.1.0. I would like to know if these fixes (3.1.1 release) are already included with HDP 3.0 when I install it via Ambari-2.7.0.0? If not, is there an updated Ambari installation URL? or updated repository URLs which I can enter in cluster installation wizard in Ambari?
... View more
Labels: