I am unable to create a support case for this, as my support license does not appear to cover this ML Experiments.
I am evaluating Cloudera CML on my company's CDP instance. I created several Experiments on a project to test out the feature (it was a simple script generating some random numbers for metrics) and the experiment created and ran ok. However, I tried to create the same experiment later and the UI just timed out. I used the developer's tools in the browser to see the request and "runs" failed with 504 timeout error. I am unable to troubleshoot the issue further in the UI, and am uncertain if ECS would have any useful information. I could get it to run successfully a few hours later. To my knowledge, nobody touched the CDP cluster during in between this period. During the period I cannot create experiments, I looked at my kubernetes namespace <CML workspace name - userXX> inside ECS, and did not see a pod starting. There was also no events being created. We would like to better understand the product before productionizing it. Could anyone advise where I might look at to debug the issue? Thank you.
The version we're using is CML 1.4 on CDP, in an intranet network.