Support Questions

MadhuNP · ‎03-23-2022

I need help understanding the root cause of a job failure in Cloudera Data Science Workbench.

What does it mean when a job is stuck in "Scheduling" phase and terminates as "Failed" after the timeout period? The log file, which usually displays the python code followed by the results from print statements and/or run time errors, states only a single line reading "Engine exited with status 34". Apparently, I was able to re-run the job successfully after a while. It could be a temporary glitch in the system but still trying to understand the root cause of this issue. Any insight on what happens during the "Job Scheduling" is much appreciated.

Mike · ‎02-01-2023

I agree with Smarak, the error code typically means that there were not enough resources available to start the job. You could probably use Grafana dashboard (available in the Admin page) to look at the cluster resources and load around the time you had this issue. Is it happening consistently?

For Jobs, I usually see this at the start or end of the month, when a lot of people schedule periodic jobs and they all trigger at the same time.

View solution in original post

araujo · ‎03-23-2022

@MadhuNP ,

It's hard to say without looking at the logs. If the job failed the log should show some exception messages.

If not, more investigation would be required to get to the root case, IMO.

André

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.

MadhuNP · ‎03-24-2022

Log file just had a single line stating "Engine exited with status 34".

smdas · ‎04-08-2022

Hello @MadhuNP

Thanks for using Cloudera Community. We see your Team is working with our Support Team for the concerned issue. Based on the Support engagement, We shall update the Post accordingly.

Regards, Smarak

Ellengogo · ‎01-24-2023

Hello, @smdas ,

do you know any cause or solution for this issue? We also faced the same problem.

Thank you for your answer.

Regards, Ellen

smdas · ‎01-24-2023

Hello @Ellengogo

I am not able to find the exact Case reference as this Post is few months old. Typically, such issues are caused by Resource Constraint, wherein the Engine Pod (Created when a Workbench Session is started) gets terminated owing to Resource Constraint. If the issue is persistent, a Support Case would be ideal as we require the review of the Kubernetes Output pertaining to the Engine Pod along with Resource Profile & other related artefacts.

Regards, Smarak

Mike · ‎02-01-2023

I agree with Smarak, the error code typically means that there were not enough resources available to start the job. You could probably use Grafana dashboard (available in the Admin page) to look at the cluster resources and load around the time you had this issue. Is it happening consistently?

For Jobs, I usually see this at the start or end of the month, when a lot of people schedule periodic jobs and they all trigger at the same time.

Support Questions

Job status "Scheduling" for a manual job