Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Job status "Scheduling" for a manual job

avatar
Explorer

I need help understanding the root cause of a job failure in Cloudera Data Science Workbench.

What does it mean when a job is stuck in "Scheduling" phase and terminates as "Failed" after the timeout period? The log file, which usually displays the python code followed by the results from print statements and/or run time errors, states only a single line reading "Engine exited with status 34". Apparently, I was able to re-run the job successfully after a while. It could be a temporary glitch in the system but still trying to understand the root cause of this issue. Any insight on what happens during the "Job Scheduling" is much appreciated.

1 ACCEPTED SOLUTION

avatar
Contributor

I agree with Smarak, the error code typically means that there were not enough resources available to start the job. You could probably use Grafana dashboard (available in the Admin page) to look at the cluster resources and load around the time you had this issue. Is it happening consistently?

 

For Jobs, I usually see this at the start or end of the month, when a lot of people schedule periodic jobs and they all trigger at the same time.

View solution in original post

6 REPLIES 6

avatar
Super Guru

@MadhuNP ,

 

It's hard to say without looking at the logs. If the job failed the log should show some exception messages.

If not, more investigation would be required to get to the root case, IMO.

 

André

 

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.

avatar
Explorer

Log file just had a single line stating "Engine exited with status 34".

avatar
Super Collaborator

Hello @MadhuNP 

 

Thanks for using Cloudera Community. We see your Team is working with our Support Team for the concerned issue. Based on the Support engagement, We shall update the Post accordingly.

 

Regards, Smarak

avatar
New Contributor

Hello, @smdas ,

do you know any cause or solution for this issue? We also faced the same problem.

Thank you for your answer.

 

Regards, Ellen

avatar
Super Collaborator

Hello @Ellengogo 

 

I am not able to find the exact Case reference as this Post is few months old. Typically, such issues are caused by Resource Constraint, wherein the Engine Pod (Created when a Workbench Session is started) gets terminated owing to Resource Constraint. If the issue is persistent, a Support Case would be ideal as we require the review of the Kubernetes Output pertaining to the Engine Pod along with Resource Profile & other related artefacts. 

 

Regards, Smarak

avatar
Contributor

I agree with Smarak, the error code typically means that there were not enough resources available to start the job. You could probably use Grafana dashboard (available in the Admin page) to look at the cluster resources and load around the time you had this issue. Is it happening consistently?

 

For Jobs, I usually see this at the start or end of the month, when a lot of people schedule periodic jobs and they all trigger at the same time.