Member since
07-06-2018
24
Posts
4
Kudos Received
4
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3093 | 11-06-2019 04:11 AM | |
2181 | 10-31-2018 04:56 AM | |
6508 | 10-14-2018 11:04 AM | |
7667 | 07-18-2018 12:26 AM |
11-06-2019
04:11 AM
After we configured these firewall rules, I was able to build the custom image: http://archive.cloudera.com/* PORT: 80 http://ppa.launchpad.net/* PORT: 80 Hint on docker company-registry: I was at the beginning a bit confused about the docker company-registry because we never set up a public docker registry. This is important because when you build and push the docker image, the commands require the company-registry. I could find it simply with the command docker images You will get a list of images registered on your machine. I found out that the company registry in my case was docker.repository.cloudera.com/cdsw/ So what I did was just building the docker image with my company-registry. In my case there was no need to push the image to docker. (see CDSW GPU Guide) docker build --network host -t docker.repository.cloudera.com/cdsw/cdsw-cuda:8 . -f cuda.Dockerfile After that, the image was built successfully. The image can be listed with the command docker images One should see the newly created docker image in the repository. For the last step the site admin. has to add this image to CDSW (see CDSW GPU Guide). The Repository:Tag in CDSW would be the same as shown above with the company-registry. In this example it would be Repository:Tag docker.repository.cloudera.com/cdsw/cdsw-cuda:8 For airgapped installations you need these firewall rules to be able to build the NVIDIA image for CDSW: (all on Port 80) http://developer.download.nvidia.com/* http://archive.ubuntu.com/ubuntu/* http://security.ubuntu.com/ubuntu/* http://archive.cloudera.com/* http://ppa.launchpad.net/* Regards.
... View more
10-29-2019
06:38 AM
I couldn't figure the problem out myself. Can someone please help? Many thanks!
... View more
10-23-2019
02:27 PM
I'm trying to create a custom CUDA-capable engine image with the CDSW GPU-Guide on our airgapped CDSW-Cluster. We have a CDSW Cluster with 1 Master-Node and 2 Woker-Nodes (one of the Worker-Nodes is equipped with NVIDIA GPUs) I used the following command on the GPU-Host, which is also given in CSDW GPU-Guide, to build the CUDA docker image: docker build --network host -t <company-registry>/cdsw-cuda:8 . -f cuda.Dockerfile Beforehand we configured these firewall rules on our GPU-Host: http://developer.download.nvidia.com/* http://archive.ubuntu.com/ubuntu/* http://security.ubuntu.com/ubuntu/* Unfortunately I get these error messages: (Full output on stdout is attached) Reading package lists... W: The repository 'http://archive.cloudera.com/kudu/ubuntu/xenial/amd64/kudu xenial-kudu5 Release' does not have a Release file. W: The repository 'http://ppa.launchpad.net/deadsnakes/ppa/ubuntu xenial Release' does not have a Release file. E: Failed to fetch http://archive.cloudera.com/kudu/ubuntu/xenial/amd64/kudu/dists/xenial-kudu5/contrib/source/Sources 403 Forbidden E: Failed to fetch http://ppa.launchpad.net/deadsnakes/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages 403 Forbidden [IP: 91.189.95.83 80] E: Some index files failed to download. They have been ignored, or old ones used instead. The command '/bin/sh -c apt-get update && apt-get install -y --no-install-recommends cuda-cudart-$CUDA_PKG_VERSION && ln -s cuda-10.0 /usr/local/cuda && rm -rf /var/lib/apt/lists/*' returned a non-zero code: 100 At this point I have several questions: Is it correct to build the docker image on the GPU-Host or should it be done on the CDSW-Master? I couldn't find some of the paths on my entire CDSW-Cluster which are given in the cuda.Dockerfile, for instance: /etc/apt/sources.list.d/ OR /usr/local/nvidia/lib OR /var/lib/apt/lists/* Is this ok? Do airgapped installations need more friewall rules to be configured? I would be grateful for some kind of solution/feedback on that. Thanks!
... View more
Labels:
10-17-2019
08:19 AM
Hello @GangWar, thank you, this was the solution! Yes we have an airgapped environment. After I ran the command you suggested on my GPU-Host, CDSW detected the GPUs. docker tag docker-registry.infra.cloudera.com/cdsw/third-party/nvidia/k8s-device-plugin:1.11 nvidia/k8s-device-plugin:1.11 A reboot or a restart of the CDSW roles was not necessary in my case. Maybe the docs of Cloudera could be updated with this information. Thank you again Sir! Regards.
... View more
10-16-2019
05:09 PM
By restarting the CDSW role I ran into another issue but I could solve it by myself. The Docker Daemon on the GPU Host couldn't started and gave these error messages: (I put it here for the sake of completeness) + result=0 + shift + err_msg='Unable to copy [/run/cloudera-scm-agent/process/5807-cdsw-CDSW_DOCKER/cdsw.conf] => [/etc/cdsw/scratch/dockerd.conf].' + '[' 0 -eq 0 ']' + return + SERVICE_PID_FILE=/etc/cdsw/scratch/dockerd.pid + curr_pid=15571 + echo 15571 + dockerd_opts=() + dockerd_opts+=(--log-driver=journald) + dockerd_opts+=(--log-opt labels=io.kubernetes.pod.namespace,io.kubernetes.container.name,io.kubernetes.pod.name) + dockerd_opts+=(--iptables=false) + '[' devicemapper == devicemapper ']' + dockerd_opts+=(-s devicemapper) + dockerd_opts+=(--storage-opt dm.basesize=100G) + dockerd_opts+=(--storage-opt dm.thinpooldev=/dev/mapper/docker-thinpool) + dockerd_opts+=(--storage-opt dm.use_deferred_removal=true) + /usr/bin/nvidia-smi + '[' 0 -eq 0 ']' + '[' true == true ']' + dockerd_opts+=(--add-runtime=nvidia=${CDSW_ROOT}/nvidia/bin/nvidia-container-runtime) + dockerd_opts+=(--default-runtime=nvidia) + mkdir -p /var/lib/cdsw/docker-tmp + die_on_error 0 'Unable to create directory [/var/lib/cdsw/docker-tmp].' + result=0 + shift + err_msg='Unable to create directory [/var/lib/cdsw/docker-tmp].' + '[' 0 -eq 0 ']' + return + HTTP_PROXY= + HTTPS_PROXY= + NO_PROXY= + ALL_PROXY= + DOCKER_TMPDIR=/var/lib/cdsw/docker-tmp + exec /opt/cloudera/parcels/CDSW-1.6.0.p1.1294376/docker/bin/dockerd --log-driver=journald --log-opt labels=io.kubernetes.pod.namespace,io.kubernetes.container.name,io.kubernetes.pod.name --iptables=false -s devicemapper --storage-opt dm.basesize=100G --storage-opt dm.thinpooldev=/dev/mapper/docker-thinpool --storage-opt dm.use_deferred_removal=true --add-runtime=nvidia=/opt/cloudera/parcels/CDSW-1.6.0.p1.1294376/nvidia/bin/nvidia-container-runtime --default-runtime=nvidia time="2019-10-17T00:54:34.267458369+02:00" level=info msg="libcontainerd: new containerd process, pid: 15799" Error starting daemon: error initializing graphdriver: devmapper: Base Device UUID and Filesystem verification failed: devicemapper: Error running deviceCreate (ActivateDevice) dm_task_run failed I could solve it as written here by removing /var/lib/docker. After this, restarting the CDSW role was possible. However, the reboot of the GPU-Host and restart of the role changed nothing unfortunately. How can I send you the logs, because I can't attach them here? Thank you!
... View more
10-15-2019
04:34 PM
Hello @GangWar, thank you for the fast reply. Here is a screenshot from our CDSW home page And this my cdsw.conf file: JAVA_HOME="/usr/java/jdk1.8.0_181" KUBE_TOKEN="34434e.107e63749c742556" DISTRO="CDH" DISTRO_DIR="/opt/cloudera/parcels" AUXILIARY_NODES="ADDRESS_OF_GPU-HOST" CDSW_CLUSTER_SECRET="CLUSTER_SECRET_KEY" DOMAIN="workbench.company.com" LOGS_STAGING_DIR="/var/lib/cdsw/tmp" MASTER_IP="ADDRESS_OF_CDSW-MASTER" NO_PROXY="" NVIDIA_GPU_ENABLE="true" RESERVE_MASTER="false" Regards!
... View more
10-15-2019
04:21 PM
Thank you very much for the reply and answer @Bhuv
... View more
10-15-2019
08:44 AM
I'm struggling with enabling GPU Support on our CDSW 1.6. We have a CDSW Cluster with 1 Master-Node and 2 Woker-Nodes (one of the Worker-Nodes is equipped with NVIDIA GPUs) What I did so far: Successfully upgraded our CDSW Cluster to CDSW 1.6 Prepared the GPU-Host regarding this CDSW NVIDIA-GPU Guide Disabled Nouveau on the GPU-Host Installed successfully the NVIDIA Driver with respect to the right kernel header version. (This NVIDIA-Guide was very useful) Checked that the NVIDIA driver is correct installed with the command nvidia-smi The output shows the correct number of NVIDIA GPUs which are installed on the GPU-Host. Added the GPU-Host with Cloudera-Manager to the CDSW-Cluster and added the worker and docker daemon role instances to the GPU-Host. Enable GPU Support for CDSW on Cloudera-Manager. Restarted the CDSW Service with Cloudera-Manager successfully. When I log into the CDSW Web-GUI I don't see any available GPUs as shown in the CDSW NVIDIA-GPU Guide. But what I can see is that my CDSW-Cluster now has more CPU and memory available since I added the GPU-Host to the CDSW-Cluster. That means that CDSW does recognize the new Host but unfortunately not my GPUs. Also, when I type the command on the GPU-Host cdsw status I can see that CDSW does not recognize the GPUs -> 0 GPUs available. Output of cdsw status Can someone please help me out? Thanks!
... View more
Labels:
10-15-2019
07:45 AM
I installed successfully CDSW 1.6 on 1 Master-Node and 2 Worker-Nodes. Unfortunately I get on both of the Worker-Nodes these error messages when I check the nodes with cdsw status: (On the Master-Node there are no error messages) Failed to run CDSW secrets check. Failed to run CDSW persistent volumes check. Failed to run CDSW persistent volumes claims check. Failed to run CDSW Ingresses check. Checking web at url: http://workbench.company.com OK: HTTP port check Cloudera Data Science Workbench is not ready yet I couldn't see any drawbacks till this time and everything seems working quite nice. But I'm interested why I get these error messages and what they mean. Can someone help please? Thanks!
... View more
Labels:
04-05-2019
01:03 AM
1 Kudo
Thank you for the great explanation @AutoIN. This solved my problem. On our CDSW cluster we have 2 nodes with a master and a slave. As described, I was able to figure out that the available cpu and memeory on both hosts are badly distributed. As an example, I'm able to spin an engine with a lot of vcpus but with little memory and vice versa. I was just not aware that a session can't share resources across nodes. Thank you very much!
... View more