Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Unable to create CML Experiment - how to debug

avatar
Explorer

Hi,

I am unable to create a support case for this, as my support license does not appear to cover this ML Experiments.

I am evaluating Cloudera CML on my company's CDP instance. I created several Experiments on a project to test out the feature (it was a simple script generating some random numbers for metrics) and the experiment created and ran ok. However, I tried to create the same experiment later and the UI just timed out. I used the developer's tools in the browser to see the request and "runs" failed with 504 timeout error. I am unable to troubleshoot the issue further in the UI, and am uncertain if ECS would have any useful information. I could get it to run successfully a few hours later. To my knowledge, nobody touched the CDP cluster during in between this period. During the period I cannot create experiments, I looked at my kubernetes namespace <CML workspace name - userXX> inside ECS, and did not see a pod starting. There was also no events being created. We would like to better understand the product before productionizing it. Could anyone advise where I might look at to debug the issue? Thank you. 

 

The version we're using is CML 1.4 on CDP, in an intranet network.

4 REPLIES 4

avatar
Explorer

Just an update: In ECS, I see that the pods are being assigned to a particular node and they're "stuck" with a message:

"failed to create pod sandbox: rpc error:code= unknown desc = failed to get sandbox image "index.docker.io/rancher/pause:3.6" failed to pull image "index.docker.rancherpause:3.6": failed to pull and unpack image "docker.io/rancher/pause:3.6".....

 

avatar
Explorer

I've raised a support case thread instead, will not be following up on this thread.

avatar
New Contributor

Hello,
I have the same issue. Have you got a solution?

avatar
Explorer

we're not certain the cause of it, but we kubectl cordon off the node and rebooted it. This issue happens sporadically. we are also running on 1.5 CML now