Member since
09-30-2021
3
Posts
4
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
444 | 09-20-2024 07:45 AM |
09-20-2024
07:45 AM
2 Kudos
FYI, we solved it. CDSW 1.10.5 ships with old nvidia-container-tools that are incompatible with new nvidia-drivers. 1. completely purge the system of any cuda-drivers, nvidida-drivers and the cuda-toolkit 2. install old nvidia-drivers (tested with 470.103.01) and DO NOT install cuda on the host! 3. ... 4. profit
... View more
09-20-2024
12:53 AM
1 Kudo
Update: I think we could further diagnose the issue. It seems that one of the cdsw pods is constantly crashing. Specifically "nvidia-device-plugin-daemonset-rpkqm" Looking at the logs from one of the crashed Containers it seems k8s is not able to invoke cuda: container_linux.go:247: starting container process caused "process_linux.go:337: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: exec command: [/opt/cloudera/parcels/CDSW/nvidia/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --utility --pid=807 /var/lib/docker/devicemapper/mnt/f713fe306e9d51cc24017a76d61b1a491533c1f626a3a89737a16f10aefe4015/rootfs]\\nnvidia-container-cli: initialization error: cuda error: os call failed or operation not supported on this os\\n\""
... View more
09-19-2024
04:59 AM
1 Kudo
Hi all, I'm trying to get our CDSW 1.10.5 Instance to recognize the GPUs, according to this documentation: https://docs.cloudera.com/cdsw/1.10.5/gpu/topics/cdsw-gpu.html Basically: nvidia-smi shows the GPUs are present on the host but after activating GPU-Support in Cloudera Manager the GPUs do not show up in the CDSW interface. We tried to follow this advice, since we also run an AirGapped Setup: https://community.cloudera.com/t5/Support-Questions/CDSW-1-6-does-not-recognize-NVIDA-GPUs/td-p/280207 but unfortunatelly retagging the image did not solve the issue. How can we obtain additional info to diagnose the issue?
... View more
Labels: