Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

CDSW 1.10.5 does not Recognize NVIDIA GPUs

avatar
New Contributor

Hi all,

I'm trying to get our CDSW 1.10.5 Instance to recognize the GPUs, according to this documentation: https://docs.cloudera.com/cdsw/1.10.5/gpu/topics/cdsw-gpu.html

Basically: nvidia-smi shows the GPUs are present on the host but after activating GPU-Support in Cloudera Manager the GPUs do not show up in the CDSW interface.

Bild.png

 

We tried to follow this advice, since we also run an AirGapped Setup: https://community.cloudera.com/t5/Support-Questions/CDSW-1-6-does-not-recognize-NVIDA-GPUs/td-p/2802...

but unfortunatelly retagging the image did not solve the issue.

 

How can we obtain additional info to diagnose the issue?

1 ACCEPTED SOLUTION

avatar
New Contributor

FYI, we solved it. CDSW 1.10.5 ships with old nvidia-container-tools that are incompatible with new nvidia-drivers. 

1. completely purge the system of any cuda-drivers, nvidida-drivers and the cuda-toolkit
2. install old nvidia-drivers (tested with 470.103.01) and DO NOT install cuda on the host!
3. ...
4. profit

Bild (1).png

View solution in original post

3 REPLIES 3

avatar
Community Manager

@MBockhacker, Welcome to our community! To help you get the best possible answer, I have tagged in our CDSW experts @ywu @Gopinath @ZsoltH  who may be able to assist you further.

Please feel free to provide any additional information or details about your query. We hope that you will find a satisfactory solution to your question.



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
New Contributor

Update: 

I think we could further diagnose the issue. It seems that one of the cdsw pods is constantly crashing. Specifically "nvidia-device-plugin-daemonset-rpkqm"

2024-09-20 09_52_05-Visual Studio Code.png


Looking at the logs from one of the crashed Containers it seems k8s is not able to invoke cuda:

 

container_linux.go:247: starting container process caused "process_linux.go:337: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: exec command: [/opt/cloudera/parcels/CDSW/nvidia/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --utility --pid=807 /var/lib/docker/devicemapper/mnt/f713fe306e9d51cc24017a76d61b1a491533c1f626a3a89737a16f10aefe4015/rootfs]\\nnvidia-container-cli: initialization error: cuda error: os call failed or operation not supported on this os\\n\""

 

avatar
New Contributor

FYI, we solved it. CDSW 1.10.5 ships with old nvidia-container-tools that are incompatible with new nvidia-drivers. 

1. completely purge the system of any cuda-drivers, nvidida-drivers and the cuda-toolkit
2. install old nvidia-drivers (tested with 470.103.01) and DO NOT install cuda on the host!
3. ...
4. profit

Bild (1).png