Created on 02-20-2023 06:33 AM - edited 02-20-2023 06:48 AM
Hi,
I am unable to create a support case for this, as my support license does not appear to cover this ML Experiments.
I am evaluating Cloudera CML on my company's CDP instance. I created several Experiments on a project to test out the feature (it was a simple script generating some random numbers for metrics) and the experiment created and ran ok. However, I tried to create the same experiment later and the UI just timed out. I used the developer's tools in the browser to see the request and "runs" failed with 504 timeout error. I am unable to troubleshoot the issue further in the UI, and am uncertain if ECS would have any useful information. I could get it to run successfully a few hours later. To my knowledge, nobody touched the CDP cluster during in between this period. During the period I cannot create experiments, I looked at my kubernetes namespace <CML workspace name - userXX> inside ECS, and did not see a pod starting. There was also no events being created. We would like to better understand the product before productionizing it. Could anyone advise where I might look at to debug the issue? Thank you.
The version we're using is CML 1.4 on CDP, in an intranet network.
Created 02-22-2023 10:17 PM
Just an update: In ECS, I see that the pods are being assigned to a particular node and they're "stuck" with a message:
"failed to create pod sandbox: rpc error:code= unknown desc = failed to get sandbox image "index.docker.io/rancher/pause:3.6" failed to pull image "index.docker.rancherpause:3.6": failed to pull and unpack image "docker.io/rancher/pause:3.6".....
Created 02-23-2023 07:35 PM
I've raised a support case thread instead, will not be following up on this thread.
Created 07-24-2023 07:09 AM
Hello,
I have the same issue. Have you got a solution?
Created on 07-25-2023 06:49 PM - edited 07-25-2023 06:50 PM
we're not certain the cause of it, but we kubectl cordon off the node and rebooted it. This issue happens sporadically. we are also running on 1.5 CML now