Created 09-19-2024 04:59 AM
Hi all,
I'm trying to get our CDSW 1.10.5 Instance to recognize the GPUs, according to this documentation: https://docs.cloudera.com/cdsw/1.10.5/gpu/topics/cdsw-gpu.html
Basically: nvidia-smi shows the GPUs are present on the host but after activating GPU-Support in Cloudera Manager the GPUs do not show up in the CDSW interface.
We tried to follow this advice, since we also run an AirGapped Setup: https://community.cloudera.com/t5/Support-Questions/CDSW-1-6-does-not-recognize-NVIDA-GPUs/td-p/2802...
but unfortunatelly retagging the image did not solve the issue.
How can we obtain additional info to diagnose the issue?
Created 09-20-2024 07:45 AM
FYI, we solved it. CDSW 1.10.5 ships with old nvidia-container-tools that are incompatible with new nvidia-drivers.
1. completely purge the system of any cuda-drivers, nvidida-drivers and the cuda-toolkit
2. install old nvidia-drivers (tested with 470.103.01) and DO NOT install cuda on the host!
3. ...
4. profit
Created 09-19-2024 09:03 AM
@MBockhacker, Welcome to our community! To help you get the best possible answer, I have tagged in our CDSW experts @ywu @Gopinath @ZsoltH who may be able to assist you further.
Please feel free to provide any additional information or details about your query. We hope that you will find a satisfactory solution to your question.
Regards,
Vidya Sargur,Created 09-20-2024 12:53 AM
Update:
I think we could further diagnose the issue. It seems that one of the cdsw pods is constantly crashing. Specifically "nvidia-device-plugin-daemonset-rpkqm"
Looking at the logs from one of the crashed Containers it seems k8s is not able to invoke cuda:
container_linux.go:247: starting container process caused "process_linux.go:337: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: exec command: [/opt/cloudera/parcels/CDSW/nvidia/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --utility --pid=807 /var/lib/docker/devicemapper/mnt/f713fe306e9d51cc24017a76d61b1a491533c1f626a3a89737a16f10aefe4015/rootfs]\\nnvidia-container-cli: initialization error: cuda error: os call failed or operation not supported on this os\\n\""
Created 09-20-2024 07:45 AM
FYI, we solved it. CDSW 1.10.5 ships with old nvidia-container-tools that are incompatible with new nvidia-drivers.
1. completely purge the system of any cuda-drivers, nvidida-drivers and the cuda-toolkit
2. install old nvidia-drivers (tested with 470.103.01) and DO NOT install cuda on the host!
3. ...
4. profit